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^ Abstract 

I 1 This work shows how to leverage causal inference to understand the behavior of com- 

^ plex learning systems interacting with their environment and predict the consequences 

of changes to the system. Such predictions allow both humans and algorithms to select 
changes that improve both the short-term and long-term performance of such systems, 
(f-) This work is illustrated by experiments carried out on the ad placement system associated 

^ with the Bing search engine. 

m 
in 

1. Introduction 

<N 

Statistical machine learning technologies in the real world are never without a purpose. 
Using their predictions, humans or machines make decisions whose circuitous consequences 
often violate the assumptions that justified the statistical approach in the first place. 

• • 

• Consider for instance the detection of fraudulent credit card transactions (e.g., Hand 
ITj and Weston, 2008). Informed by the outputs of a statistical system, the card operator 

j_j makes decisions such as declining a transaction. These decisions have an immediate 

impact on the card operator earnings. They also affect the satisfaction of both the 
merchant and the customer and therefore impact the future earnings of the card 
operator. The card operator decisions can also change the behavior of the fraudsters. 
Finally the card operator decisions change the data that are collected and used to 
train or update the fraud detection system itself. 

• Consider the placement of advertisements on the result pages of Internet search engines 
(e.g., Edelman et al., 2007). The placement decisions depend on the bids of the 
advertisers and on scores computed by statistical machine learning systems. Because 
these decisions define the contents of the result page proposed to the user, they directly 
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influence both the occurrence of clicks and the corresponding advertiser payments. 
The placement decisions also impact the user satisfaction with this search engine. 
Meanwhile the future bids of advertisers depend on how much value they see in the 
Internet traffic they receive because of their advertisements. Finally the placement 
decisions affect the collection of potential training data. 

Meanwhile the designer of the learning system faces a different set of questions: Is it useful 
to pass a new input signal to the statistical model? Is it worthwhile to collect and label a 
new training set? What about changing the loss function or the learning algorithm? In order 
to answer such questions and improve the operational performance of the learning system, 
one needs to unravel how the information produced by the statistical models traverses this 
web of causes and consequences and produces measurable losses and rewards. 

We can phrase similar questions about the parameters of the model. Is it worthwhile 
to move the parameters along this specific direction? Is it worthwhile to replace them by 
this new value? An automated way to generate and answer such questions immediately 
leads to a learning algorithm. Therefore, in order to design a learning algorithm, one also 
needs to understand how the graph of causes and consequences maps the model outputs 
into measurable losses and rewards. 

This work assumes that we have some knowledge of the structure of the causal graph 
and describes how to address these questions using the principles of causal inference. The 
tools described in the following sections are closely related to methods of reinforcement 
learning (Sutton and Barto, 1998) and methods proposed for various special cases such as 
multi-armed bandits (Robbins, 1952) and contextual bandits (Langford and Zhang, 2008) 
problems (see also the bibliographical notes, appendix A. 5). The fundamental contribution 
of this work is to demonstrate the qualitative and quantitative benefits afforded by paying 
a closer attention to the detailed structure of the graph of causes and consequences. 

This paper is structured as follows: 

• Section 2 gives an overview of the advertisement placement problem which serves as 
our main example. In particular, we stress some of the difficulties encountered when 
one approaches such a problem without a principled perspective. 

• Section 3 provides a condensed review of the essential concepts of causal modeling 
and inference. A special attention is paid to the isolation assumption which allows us 
to interpret the data as repeated independent trials amenable to statistical analysis. 

• Section 4 centers on formulating and answering questions of the form: how would 
the system have performed during the data collection period if certain interventions 
had been carried out on the system? This process is called counterfactual analysis 
because such questions pertain to events that did not happen but could have happened. 
We describe importance sampling methods for counterfactual analysis, with clear 
conditions of validity and confidence intervals. 

• Section 5 describes useful importance sampling techniques, including techniques to 
reduce the variance and improve confidence intervals, and techniques to estimate 
derivatives. 
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• Section 6 describes how covmterfactual analysis provides the essential signal for the 
design of learning algorithms. Assume that we have identified specific interventions 
that would have caused the system to perform well during the data collection period. 
Which guarantee can we obtain on the performance of these same interventions in the 
future? 

• Section 7 presents counterfactual differential techniques for the study of equlibria. 
Using data collected when the system is at equilibrium, we can estimate how small 
interventions change the point equilibrium. This provides an elegant and effective way 
to reason about long-term feedback effects. 

This work does not discuss learning algorithms but deals with the identification and the 
measurement of interpretable signals that justify the actions of humans and machines alike. 
Whether these signals are exploited by human decision makers or by machine learning 
algorithms is marginally relevant to our approach. Since real world learning systems often 
involve a mixture of human decision and automated processes, it makes sense to separate the 
discussion of the learning signals from the discussion of the learning algorithms that leverage 
them. This is not a new idea. Wiener (1948) argues that the study of the propagation of 
learning signals constitutes the discipline that he calls cybernetics. 

2. Difficulties 

After giving an overview of the advertisement placement problem, which serves as our main 
example in this work, this section illustrates some of the difficulties that arise when one 
does not pay sufficient attention to the causal structure of the learning system. 

2.1 Advertisement Placement 

All Internet users are now familiar with the advertisement messages that adorn popular 
web pages. Advertisements are particularly effective on search engine result pages because 
users who are searching for something are good targets for advertisers who have something 
to offer. Several actors take part in this Internet advertisement game: 

• Advertisers create advertisement messages, and place bids that describe how much 
they are willing to pay to see their ads displayed or clicked. 

• Publishers provide attractive web services, such as, for instance, an Internet search 
engine. They display selected ads and expect to receive payments from the advertisers. 
The infrastructure to collect the advertiser bids and select ads is sometimes provided 
by an advertising network on behalf of its affiliated publishers. For the purposes of 
this work, we simply consider a publisher large enough to run its own infrastructure. 

• Users reveal information about their current interests, for instance, by entering a 
query in a search engine. They are offered web pages containing a selection of ads 
(figure 1). Users sometimes click on an advertisement and are transported to a web 
site controlled by the advertiser where they can initiate some business. 

A conventional bidding language is necessary to precisely define under which conditions an 
advertiser is willing to pay the bid amount. In the case of Internet search advertisement, 
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Figure 1: Mainline and sidebar ads on a search result page. Ads placed in the mainline 
are more likely to be noticed, increasing both the chances of a click if the ad is 
relevant and the risk of annoying the user if the ad is not relevant. 



each bid specifies (a) the advertisement message, (b) a set of keywords, (c) one of several 
possible matching criteria between the keywords and the user query, and (d) the maximal 
price the advertiser is willing to pay when a user clicks on the ad after entering a query 
that matches the keywords according to the specified criterion. 

Whenever a user visits a publisher web page, an advertisement placement engine runs 
an auction in real time in order to select winning ads, determine where to display them 
in the page, and compute the prices charged to advertisers, should the user click on their 
ad. Since the placement engine is operated by the publisher, it is designed to further the 
interests of the publisher. Fortunately for everyone else, the publisher must balance short 
term interests, namely the immediate revenue brought by the ads displayed on each web 
page, and long term interests, namely the future revenues resulting from the continued 
satisfaction of both users and advertisers. 

Auction theory explains how to design a mechanism that optimizes the revenue of the 
seller of a single object (Myerson, 1981; Milgrom, 2004) under various assumptions about the 
information available to the buyers regarding the intentions of the other buyers. In the case 
of the ad placement problem, the publisher runs multiple auctions and sells opportunities 
to receive a click. When nearly identical auctions occur thousand of times per second, 
it is tempting to consider that the advertisers have perfect information about each other. 
This assumption gives support to the popular generalized second price rank-score auction 
(Varian, 2007; Edelman et al., 2007): 

• Let x represent the auction context information, such as the user query, the user 
profile, the date, the time, etc. The ad placement engine first determines all eligible 
ads a\ . . . a n and the corresponding bids b\ . . . b n on the basis of the auction context 
x and of the matching criteria specified by the advertisers. 

• For each selected ad ai and each potential position p on the web page, a statistical 
model outputs the estimate qi tP (x) of the probability that ad ai displayed in position p 
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receives a user click. The rank-score ri p (x) = biqi tP (x) then represents the purported 
value associated with placing ad Gtj at position p. 

• Let L represent a possible ad layout, that is, a set of positions that can simultaneously 
be populated with ads, and let C be the set of possible ad layouts, including of course 
the empty layout. The optimal layout and the corresponding ads are obtained by 
maximizing the total rank-score 

max max V n Jx) , (1) 

LeC Hi*a,-- — r 
p&L 

subject to reserve constraints 

Mp G L, r iptP (x) > R p {x) , (2) 

and also subject to diverse policy constraints, such as, for instance, preventing the 
simultaneous display of multiple ads belonging to the same advertiser. Under mild 
assumptions, this discrete maximization problem is amenable to computationally ef- 
ficient greedy algorithms (see appendix A.l.) 

• The advertiser payment associated with a user click is computed using the generalized 
second price (GSP) rule: the advertiser pays the smallest bid that it could have entered 
without changing the solution of the discrete maximization problem, all other bids 
remaining equal. In other words, the advertiser could not have manipulated its bid 
and obtained the same treatment for a better price. 

Under the perfect information assumption, the analysis suggests that the publisher sim- 
ply needs to find which reserve prices R p {x) yield the best revenue per auction. However, 
the total revenue of the publisher also depends on the traffic experienced by its web site. 
Displaying excessive numbers of irrelevant ads can train users to ignore the ads, and can 
also drive them to competing web sites. Advertisers can artificially raise the rank-scores of 
irrelevant ads by temporarily increasing the bids. Indelicate advertisers can create deceiving 
advertisement messages that elicit many clicks but direct users to spam web sites. Experi- 
ence shows that the continued satisfaction of the users is more important to the publisher 
than it is to the advertisers. 

Therefore the generalized second price rank-score auction has evolved. Rank-scores have 
been augmented with terms that quantify the user satisfaction or the ad relevance. Bids 
receive adaptive discounts in order to deal with situations where the perfect information as- 
sumption is unrealistic. These adjustments are driven by additional statistical models. The 
ad placement engine should therefore be viewed as a complex learning system interacting 
with both users and advertisers. 

2.2 Controlled Experiments 

The designer of such an ad placement engine faces the fundamental question of testing 
whether a proposed modification of the ad placement engine results in an improvement of 
the operational performance of the system. 

The simplest way to answer such a question is to try the modification. The basic idea is 
to randomly split the users into treatment and control groups (Kohavi et al., 2008). Users 
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from the control group see web pages generated using the unmodified system. Users of the 
treatment groups see web pages generated using alternate versions of the system. Monitor- 
ing various performance metrics for a couple months usually gives sufficient information to 
reliably decide which variant of the system delivers the most satisfactory performance. 

Modifying an advertisement placement engine elicits reactions from both the users and 
the advertisers. Whereas it is easy to split users into treatment and control groups, splitting 
advertisers into treatment and control groups demand special attention because each auction 
involves multiple advertisers (Charles et al., 2012). Simultaneously controlling for both users 
and advertisers is probably impossible. 

Controlled experiments also suffer from several drawbacks. They are expensive because 
they demand a complete implementation of the proposed modifications. They are slow 
because each experiment typically demands a couple months. Finally, although there are 
elegant ways to efficiently run overlapping controlled experiments on the same traffic (Tang 
et al., 2010), they are limited by the volume of traffic available for experimentation. 

It is therefore difficult to rely on controlled experiments during the conception phase of 
potential improvements to the ad placement engine. It is similarly difficult to use controlled 
experiments to drive the training algorithms associated with click probability estimation 
models. Cheaper and faster statistical methods are needed to drive these essential aspects 
of the development of an ad placement engine. Unfortunately, interpreting cheap and fast 
data can be very deceiving. 

2.3 Confounding Data 

Assessing the consequence of an intervention using statistical data is generally challenging 
because it is often difficult to determine whether the observed effect is a simple consequence 
of the intervention or has other uncontrolled causes. 

For instance, the empirical comparison of certain kidney stone treatments illustrates 
this difficulty (Charig et al., 1986). Table 1 reports the success rates observed on two 
groups of 350 patients treated with respectively open surgery (treatment A, with 78% 
success) and percutaneous nephrolithotomy (treatment B, with 83% success). Although 
treatment B seems more successful, it was more frequently prescribed to patients suffering 
from small kidney stones, a less serious condition. Did treatment B achieve a high success 
rate because of its intrinsic qualities or because it was preferentially applied to less severe 
cases? Further splitting the data according to the size of the kidney stones reverses the 
conclusion: treatment A now achieves the best success rate for both patients suffering from 
large kidney stones and patients suffering from small kidney stones. Such an inversion of 
the conclusion is called Simpson's paradox (Simpson, 1951). 

The stone size in this study is an example of a confounding variable, that is an uncon- 
trolled variable whose consequences pollute the effect of the intervention. Doctors knew 
the size of the kidney stones, chose to treat the healthier patients with the least invasive 
treatment B, and therefore caused treatment B to appear more effective than it actually 
was. If we now decide to apply treatment B to all patients irrespective of the stone size, we 
break the causal path connecting the stone size to the outcome, we eliminate the illusion, 
and we will experience disappointing results. 
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Table 1: A classic example of Simpson's paradox. The table reports the success rates of 
two treatments for kidney stones (Charig et al., 1986, tables I and II). Although 
the overall success rate of treatment B seems better, treatment B performs worse 
than treatment A on both patients with small kidney stones and patients with 
large kidney stones. See section 2.3. 





Overall 


Patients with 
small stones 


Patients with 
large stones 


Treatment A: 
Open surgery 


78% (273/350) 


93% (81/87) 


73% (192/263) 


Treatment B: 

Percutaneous nephrolithotomy 


83% (289/350) 


87% (234/270) 


69% (55/80) 



When we suspect the existence of a confounding variable, we can split the contingency 
tables and reach improved conclusions. Unfortunately we cannot fully trust these conclu- 
sions unless we are certain to have taken into account all confounding variables. The real 
problem therefore comes from the confounding variables we do not know. 

Randomized experiments arguably provide the only correct solution to this problem (see 
Stigler, 1992). The idea is to randomly chose whether the patient receives treatment A or 
treatment B. Because this random choice is independent from all the potential confounding 
variables, known and unknown, they cannot pollute the observed effect of the treatments 
(see also section 4.2). This is why controlled experiments in ad placement (section 2.2) 
randomly distribute users between treatment and control groups, and this is also why, in 
the case of an ad placement engine, we should be somehow concerned by the practical 
impossibility to randomly distribute both users and advertisers. 

2.4 Confounding Data in Ad Placement 

Let us return to the question of assessing the value of passing a new input signal to the ad 
placement engine click prediction model. Section 2.1 outlines a placement method where 
the click probability estimates qi tP (x) depend on the ad and the position we consider, but 
do not depend on other ads displayed on the page. We now consider replacing this model 
by a new model that additionally uses the estimated click probability of the top mainline 
ad to estimate the click probability of the second mainline ad (figure 1). We would like to 
estimate the effect of such an intervention using existing statistical data. 

We have collected ad placement data for Bing 1 search result pages served during three 
consecutive hours on a certain slice of traffic. Let gi and q2 denote the click probability 
estimates computed by the existing model for respectively the top mainline ad and the 
second mainline ad. After excluding pages displaying fewer than two mainline ads, we form 
two groups of 2000 pages randomly picked among those satisfying the conditions q± < 0.15 
for the first group and q\ > 0.15 for the second group. Table 2 reports the click counts 
and frequencies observed on the second mainline ad in each group. Although the overall 

1. http://bing.com 
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Table 2: Confounding data in ad placement. The table reports the click-through rates and 
the click counts of the second mainline ad. The overall counts suggest that the 
click-through rate of the second mainline ad increases when the click probability 
estimate q\ of the top ad is high. However, if we further split the pages according 
to the click probability estimate q 2 of the second mainline ad, we reach the opposite 
conclusion. See section 2.4. 



Overall 


q2 low 


q 2 high 


qi low 6.2% (124/2000) 


5.1% (92/1823) 


18.1% (32/176) 


qi high 7.5% (149/2000) 


4.8% (71/1500) 


15.6% (44/500) 



numbers show that users click more often on the second mainline ad when the top mainline 
ad has a high click probability estimate q\ , this conclusion is reversed when we further split 
the data according to the click probability estimate q 2 of the second mainline ad. 

Despite superficial similarities, this example is considerably more difficult to interpret 
than the kidney stone example. The overall click counts show that the actual click-through 
rate of the second mainline ad is positively correlated with the click probability estimate 
on the top mainline ad. Does this mean that we can increase the total number of clicks by 
placing regular ads below frequently clicked ads? 

Remember that the click probability estimates depend on the search query which itself 
depends on the user intention. The most likely explanation is that pages with a high q\ are 
frequently associated with more commercial searches and therefore receive more ad clicks 
on all positions. The observed correlation occurs because the presence of a click and the 
magnitude of the click probability estimate q\ have a common cause: the user intention. 
Meanwhile, the click probability estimate q 2 returned by the current model for the second 
mainline ad also depend on the query and therefore the user intention. Therefore, assuming 
that this dependence has comparable strength, and assuming that there are no other causal 
paths, splitting the counts according to the magnitude of q 2 factors out the effects of this 
common confounding cause. We then observe a negative correlation which now suggests 
that a frequently clicked top mainline ad has a negative impact on the click-through rate 
of the second mainline ad. 

If this is correct, we would probably increase the accuracy of the click prediction model 
by switching to the new model. This would decrease the click probability estimates for 
ads placed in the second mainline position on commercial search pages. These ads are 
then less likely to clear the reserve and therefore more likely to be displayed in the less 
attractive sidebar. The net result is probably a loss of clicks and a loss of money despite 
the higher quality of the click probability model. Although we could tune the reserve prices 
to compensate this unfortunate effect, nothing in these data tells us where the performance 
of the ad placement engine will land. Furthermore, unknown confounding variables might 
completely reverse our conclusions. 

Making sense out of such data is just too complex! 
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2.5 A Better Way 

It should now be obvious that we need a more principled way to reason about the effect 
of potential interventions. We provide one such more principled approach using the causal 
inference machinery (section 3). The next step is then the identification of a class of 
questions that are sufficiently expressive to guide the designer of a complex learning system, 
and sufficiently simple to be answered using data collected in the past using adequate 
procedures (section 4). 

A machine learning algorithm can then be viewed as an automated way to generate 
questions about the parameters of a statistical model, obtain the corresponding answers, 
and update the parameters accordingly (section 6). Learning algorithms derived in this 
manner are very flexible: human designers and machine learning algorithms can cooperate 
seamlessly because they rely on similar sources of information. 

3. Modeling Causal Systems 

When we point out a causal relationship between two events, we describe what we expect to 
happen to the event we call the effect, should an external operator manipulate the event we 
call the cause. Manipulability theories of causation (von Wright, 1971; Woodward, 2005) 
raise this commonsense insight to the status of a definition of the causal relation. Difficult 
adjustments are then needed to interpret statements involving causes that we can only 
observe through their effects, "because they love me, " or that are not easily manipulated, 
"because the earth is round." 

Modern statistical thinking makes a clear distinction between the statistical model and 
the world. The actual mechanisms underlying the data are considered unknown. The 
statistical models do not need to reproduce these mechanisms to emulate the observable 
data (e.g., Breiman, 2001). Better models are sometimes obtained by deliberately avoiding 
to reproduce the true mechanisms (e.g., Vapnik, 1982, section 8.6). We can solve the 
manipulability puzzle by viewing causation as a component of a reasoning model (Bottou, 
2011) rather than a property of the world. In this perspective, causes and effects are only 
the pieces of reasoning games played in our minds. What makes a collection of causal 
statements valid is simply the accuracy of the conclusions we reach when we reason about 
manipulations or interventions amenable to experimental validation. 

This section presents the rules of this reasoning game. We largely follow the framework 
proposed by Pearl (2009) because it gives a clear account of the connections between causal 
models and probabilistic models. 

3.1 The Flow of Information 

Figure 2 gives a deterministic description of the operation of the ad placement engine. 
Variable u represents the user and his or her intention in an unspecified manner. The 
query and query context x is then expressed as an unknown function of the u and of a 
noise variable e\. Noise variables in this framework are best viewed as independent sources 
of randomness useful for modeling a nondeterministic causal dependency. We shall only 
mention them when they play a specific role in the discussion. The set of eligible ads a 
and the corresponding bids b are then derived from the query x and the ad inventory v 
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X = 


fi(u, 


ei) 


Query context x from user intent u. 


a = 


fc(x, 


v,e 2 ) 


Eligible ads (a*) from query a; and inventory u. 


b = 


h(x, 


v,e s ) 


Corresponding bids (bi). 


Q = 


U{x, 


a, e 4 ) 


Scores from query a; and ads a. 


s = 


h(a, 


Q,b,e 5 ) 


Ad slate s from eligible ads a, scores q and bids b. 


c = 


h(a, 


Q,b, e fl ) 


Corresponding click prices c. 


y = 


Ms, 


u,e 7 ) 


User clicks y from ad slate s and user intent u. 


z = 


fs(y, 


c, e 8 ) 


Revenue z from clicks y and prices c. 



Figure 2: A structural equation model for ad placement. The sequence of equations de- 
scribes the flow of information. The functions f}~ describe how effects depend 
on their direct causes. The additional noise variables represent independent 
sources of randomness useful to model probabilistic dependencies. 



userjntent u 




query x 




adjnventory v 




clicks y 




revenue z 





Figure 3: Causal graph associated with the ad placement structural equation model (fig- 
ure 2). Nodes with yellow (resp. blue) background indicate bound variables with 
known (resp. unknown) functional dependencies. The mutually independent noise 
variables are implicit. 



supplied by the advertisers. Statistical models then compute a collection of scores q such 
as the click probability estimates qi tP and the reserves R p introduced in section 2.1. The 
placement logic uses these scores to generate the "ad slate" s, that is, the set of winning 
ads and their assigned positions. The corresponding click prices c are computed. The set 
of user clicks y is expressed as an unknown function of the ad slate s and the user intent u. 
Finally the revenue z is expressed as another function of the clicks y and the prices c. 

Such a system of equations is named structural equation model. Each equation asserts 
a functional dependency between an effect, appearing on the left hand side of the equation, 
and its direct causes, appearing on the right hand side as arguments of the function. Some 
of these causal dependencies are unknown. Although we postulate that the effect can be 
expressed as some function of its direct causes, we do not know the form of this function. 
For instance, the designer of the ad placement engine knows functions j% to f$ and fs 
because he has designed them. However, he does not know the functions /i and fj because 
whoever designed the user did not leave sufficient documentation. 
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Figure 3 represents the directed causal graph associated with the structural equation 
model. Each arrow connects a direct cause to its effect. The noise variables are omitted for 
simplicity. The structure of this graph reveals fundamental assumptions about our model. 
For instance, the user clicks y do not directly depend on the scores q or the prices c because 
users do not have access to this information. 

We hold as a principle that causation obeys the arrow of time: causes always precede 
their effects. Therefore the causal graph must be acyclic. Structural equation models then 
support two fundamental operations, namely simulation and intervention. 

• Simulation - Let us assume that we know both the exact form of all functional de- 
pendencies and the value of all exogenous variables, that is, the variables that never 
appear in the left hand side of an equation. We can compute the values of all the 
remaining variables by applying the equations in their natural time sequence. 

• Intervention - As long as the causal graph remains acyclic, we can construct derived 
structural equation models using arbitrary algebraic manipulations of the system of 
equations. For instance, we can clamp a variable to a constant value by rewriting the 
right-hand side of the corresponding equation as the specified constant value. 

The algebraic manipulation of the structural equation models provides a powerful language 
to describe interventions on a causal system. This is not a coincidence. Many aspects of the 
mathematical notation were invented to support causal inference in classical mechanics. We 
no longer interpret the variable values as physical quantities: the equations simply describe 
the flow of information in the causal model (Wiener, 1948). 

3.2 The Isolation Assumption 

Let us now turn our attention to the exogenous variables, that is, variables that never appear 
in the left hand side of an equation of the structural model. Leibniz's principle of sufficient 
reason claims that there are no facts without causes. This suggests that the exogenous 
variables are the effects of a network of causes not expressed by the structural equation 
model. For instance, the user intent u and the ad inventory v in figure 3 have temporal 
correlations because both users and advertisers worry about their budgets when the end of 
the month approaches. Any structural equation model should then be understood in the 
context of a larger structural equation model potentially describing all things in existence. 

Ads served on a particular page contribute to the continued satisfaction of both users 
and advertisers, and therefore have an effect on their willingness to use the services of the 
publisher in the future. The ad placement structural equation model shown in figure 2 only 
describes the causal dependencies for a single page and therefore cannot account for such 
effects. Consider however a very large structural equation model containing a copy of the 
page-level model for every web page ever served by the publisher. Figure 4 shows how we 
can thread the page-level models corresponding to pages served to the same user. Similarly 
we could model how advertisers track the performance and the cost of their advertisements 
and model how their satisfaction affects their future bids. The resulting causal graphs can 
be very complex. Part of this complexity results from time-scale differences. Thousands 
of search pages are served in a second. Each page contributes a little to the continued 
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Figure 4: Conceptually unrolling the user feedback loop by threading instances of the single 
page causal graph (figure 3). Both the ad slate St and user clicks yt have an 
indirect effect on the user intent ut+i associated with the next query. 



satisfaction of one user and a few advertisers. The accumulation of these contributions 
produces measurable effects after a few weeks. 

Many of the functional dependencies expressed by the structural equation model are left 
unspecified. Without direct knowledge of these functions, we must reason using statistical 
data. The most fundamental statistical data is collected from repeated trials that are 
assumed independent. When we consider the large structured equation model of everything, 
we can only have one large trial producing a single data point. 2 It is therefore desirable to 
identify repeated patterns of identical equations that can be viewed as repeated independent 
trials. 

Therefore, when we study a structural equation model representing such a pattern, we 
need to make an isolation assumption that expresses the idea that the oucome of one trial 
cannot affect the following trials. This can be achieved by assuming that the exogenous vari- 
ables are drawn from an unknown but fixed joint probability distribution. This assumption 
cuts the causation effects that could flow through the exogenous variables. 

The noise variables are also exogenous variables acting as independent source of ran- 
domness useful to represent the conditional distribution P( effect | causes) using the equation 
effect = /(causes, e). We therefore also assume joint independence between all the noise 
variables and any of the named exogenous variable. 3 For instance, in the case of the ad 
placement model shown in figure 2, we assume that the joint distribution of the exogenous 
variables factorizes as 

P(u, v, ex,..., e 8 ) = P(«,t;)P(e 1 )...P(E g ). (3) 

Since an isolation assumption is only true up to a point, it should be expressed clearly 
and remain under constant scrutiny. We must therefore measure additional performance 
metrics that reveal how the isolation assumption holds. For instance, the ad placement 
structural equation model and the corresponding causal graph (figures 2 and 3) do not take 
user feedback or advertiser feedback into account. Measuring the revenue is not enough 

2. See also the discussion on reinforcement learning, section 3.4. 

3. Rather than letting two noise variables display measurable statistical dependencies because they share 
a common cause, we prefer to name the common cause and make the dependency explicit in the graph. 
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u, v, x, a, b 
q,s,c,y,z 



P(u,v) 
X P(x I u) 
X P(a | x, v) 
X P(b | x, v) 
X P(q | x, a) 
X P(s | a, q, b) 
X P(c | a, g, 6) 
X P(y | s,u) 
x P(z|y,c) 



Exogenous vars. 

Query. 

Eligible ads. 

Bids. 

Scores. 

Ad slate. 

Prices. 

Clicks. 

Revenue. 



Figure 5: Markov factorization of the structural equation model of figure 2. 




because we could easily generate revenue at the expense of the satisfaction of the users 
and advertisers. When we evaluate interventions under such an isolation assumption, we 
also need to measure a battery of additional measurements that act as proxies for the user 
and advertiser satisfaction. Noteworthy examples include ad relevance estimated by human 
judges, and advertiser surplus estimated from the auctions (Varian, 2009). 

3.3 Markov Factorization 

Conceptually, we can draw a sample of the exogenous variables using the distribution spec- 
ified by the isolation assumption, and we can then generate values for all the remaining 
variables by simulating the structural equation model. 

This process defines a generative probabilistic model representing the joint distribution 
of all variables in the structural equation model. The distribution readily factorizes as the 
product of the joint probability of the named exogenous variables, and, for each equation 
in the structural equation model, the conditional probability of the effect given its direct 
causes (Pearl, 2000). As illustrated by figures 5 and 6, this Markov factorization connects the 
structural equation model that describes causation, and the Bayesian network that models 
the joint probability distribution followed by the variables under the isolation assumption. 
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Structural equation models and Bayesian networks appear so intimately connected that 
it could be easy to forget the differences. The structural equation model is an algebraic 
object. As long as the causal graph remains acyclic, algebraic manipulations are inter- 
preted as interventions on the causal system. The Bayesian network is a statistical model 
representing a class of joint probability distributions, and, as such, does not support alge- 
braic manipulations. However its Markov factorization is an algebraic object, essentially 
equivalent to the structural equation model. 

Consider a causal system represented by a structural equation model with some unknown 
functional dependencies. We can collect statistical data during experiments involving differ- 
ent interventions on the causal system. These interventions can be represented as algebraic 
manipulations of the structural equation. Each intervention leads to a different Bayesian 
network representing the joint probability distribution of the data collected during the cor- 
responding experiment. However, the Markov factorizations of all these Bayesian networks 
share factors with the original Markov factorization. If one experiment allows us to discover 
some aspect of one of these shared factors, we can transfer this discovery into the Bayesian 
networks describing the statistical properties of other experiments. The causal modeling 
framework presented in this section is therefore a powerful transfer learning scheme char- 
acterized by a family of statistical models endowed with an algebraic structure. Such a 
scheme is sometimes called a reasoning model (Bottou, 2011). 

3.4 Special Cases 

Three special cases of causal models with increasing generality are particularly relevant. 

• In the multi-armed bandit (Robbins, 1952), a user-defined policy function ir deter- 
mines the distribution of action a G {1 . . . K}, and an unknown reward function r 
determines the distribution of the outcome y given the action a (figure 7) . In order to 
maximize the accumulated rewards, the player must construct policies ir that balance 
the exploration of the action space with the exploitation of the best action identified 
so far (e.g., Auer et al., 2002; Audibert et al., 2007; Seldin et al., 2012). 

• The contextual bandit problem (Langford and Zhang, 2008) significantly increases the 
complexity of multi-armed bandits by adding one exogenous variable x to the policy 
function n and the reward functions r (figure 8). 

• Both multi-armed bandit and contextual bandit are special case of reinforcement 
learning (Sutton and Barto, 1998). In essence, a Markov decision process is a sequence 
of contextual bandits where the context is no longer an exogenous variable but a state 
variable that depends on the previous states and actions (figure 9). Note that the 
policy function it, the reward function r, and the transition function s are independent 
of time. All the time dependencies are expressed using the states St- 

These special cases have increasing generality. Many simple structural equation models can 
be reduced to a contextual bandit problem using appropriate definitions of the context x, 
the action a and the outcome y. For instance, assuming that the prices c are discrete, the 
ad placement structural equation model shown in figure 2 reduces to a contextual bandit 
problem with context (u,v), actions (s,c) and reward z. Similarly, given a sufficiently 
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a = 7r(£) Action a G {1 . . . K} 

y = r(a, e' ) Reward i/eR 

Figure 7: Structural equation model for the multi-armed bandit problem. The policy it 
selects a discrete action a, and the reward function r determines the outcome y. 
The noise variables e and e' represent independent sources of randomness useful 
to model probabilistic dependencies. 



a = 7t(x, e) Action a € {1 ... K} 

y = r(x, a, e') Reward y e K 

Figure 8: Structural equation model for contextual bandit problem. Both the action and 
the reward depend on an exogenous context variable x. 



a t = 7r(s t _i, e t ) Action 

Vt = r(s t -i, a t , e' t ) Reward r t G R 

s t = s(s t _i, a t , e'l ) Next state 

Figure 9: Structural equation model for reinforcement learning. The above equations are 
replicated for all t G {0 . . . , T}. The context is now provided by a state variable 
Sf-i that depends on the previous states and actions. 



intricate definition of the state variables st, all structural equation models with discrete 
variables can be reduced to a reinforcement learning problem. Such reductions lose the fine 
structure of the causal graph. We show in section 5.1 how this fine structure can in fact be 
leveraged to obtain more information from the same experiments. 

Modern reinforcement learning algorithms (see Sutton and Barto, 1998) leverage the 
assumption that the policy function, the reward function, the transition function, and the 
distributions of the corresponding noise variables, are independent from time. This property 
provides great benefits when the observed sequences of actions and rewards are long in 
comparison with the size of the state space. Only section 7 in this contribution presents 
methods that take advantage of such an invariance. The general question of leveraging 
arbitrary functional invariances in causal graphs is left for future work. 

4. Counterfactual Analysis 

We now return to the problem of formulating and answering questions about the value of 
proposed changes of a learning system. Assume for instance that we consider replacing the 
score computation model M of an ad placement engine by an alternate model M* . We seek 
an answer to the conditional question: 

"How will the system perform if we replace model M by model M* ?" 
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Intervention 



image x recognized class y 



label i 



loss I 



Figure 10: Causal graph for an image recognition system. We can estimate counterfactuals 
by replaying data collected in the past. 



Given sufficient time and sufficient resources, we can obtain the answer using a controlled 
experiment (section 2.2). However, instead of carrying out a new experiment, we would like 
to obtain an answer using data that we have already collected in the past. 

11 How would the system have performed if, when the data was collected, we had 
replaced model M by model M* ?" 

The answer of this cownterfactual question is of course a counterfactual statement that 
describes the system performance subject to a condition that did not happen. 

Counterfactual statements challenge ordinary logic because they depend on a condition 
that is known to be false. Although assertion A B is always true when assertion A 
is false, we certainly do not mean for all counterfactual statements to be true. Lewis 
(1973) navigates this paradox using a modal logic in which a counterfactual statement 
describes the state of affairs in an alternate world that resembles ours except for the specified 
differences. Counterfactuals indeed offer many subtle ways to qualify such alternate worlds. 
For instance, we can easily describe isolation assumptions (section 3.2) in a counterfactual 
question: 

11 How would the system have performed if, when the data was collected, we had 
replaced model M by model M* without incurring user or advertiser reactions?" 

The fact that we could not have changed the model without incurring the user and advertiser 
reactions does not matter any more than the fact that we did not replace model M by 
model M* in the first place. This does not prevent us from using counterfactuals statements 
to reason about cause and effects. Counterfactual questions and statements provide a 
natural framework to express and share our conclusions. 

The remaining text in this section explains how we can answer certain counterfactual 
questions using data collected in the past. 

4.1 Replaying a Data Set 

Figure 10 shows the causal graph associated with a simple image recognition system. The 
classifier takes an image x and produces a prospective class label y. The loss measures 
the penalty associated with recognizing class y while the true class is y. To estimate the 
expected error of such a classifier, we collect a representative data set composed of labeled 
images, run the classifier on each image, and average the resulting losses. 

We can then replay the data set at will to estimate what (counterfactual) performance 
would have been observed if we had used a different classifier. We can then select in 
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treatment u 



Intervention 



Measure 



patient x — outcome y 



Figure 11: Causal graph for a randomized experiment. We can estimate certain counter- 
factuals by reweighting data collected in the past. 



retrospect the classifier that would have worked the best and hope that it will keep working 
well. This is the counterfactual viewpoint on empirical risk minimization (Vapnik, 1982). 

This procedure works because both the alternate classifier and the loss function are 
known. More generally, to estimate a counterfactual by replaying a data set, we need 
to know all the functional dependencies associated with all causal paths connecting the 
intervention point to the measurement point. This is obviously not always the case. 

4.2 Randomized Experiments 

Figure 11 illustrates the randomized experiment suggested in section 2.3. The patients are 
randomly split into two equally sized groups receiving respectively treatments A and B. The 
overall success rate for this experiment is therefore Y = (Ya + Yb)/2 where Ya and Yg are 
the success rates observed for each group. We would like to estimate which (counterfactual) 
overall success rate Y* would have been observed if we had selected treatment A with 
probability p and treatment B with probability 1 — p. 

Since we do not know how the outcome depends on the treatment and the patient 
condition, we cannot compute which outcome y* would have been obtained if we had treated 
patient x with a different treatment u* . Therefore we cannot answer this question by 
replaying the data as we did in section 4.1. 

The common cause principle (Reichenbach, 1956) formalizes a fundamental intuition 
about causation: if two events are correlated, then the first event causes the second event, 
or the second event causes the first event, or the two events have common causes. Observing 
different success rates Ya and Yb for the treatment groups reveals an empirical correlation 
between the treatment u and the outcome y. Since the only cause of the treatment u was 
a roll of the dices, we can reject two of the three cases enumerated by the common cause 
principle. Having eliminated the possibility of a confounding common cause, we can simply 
reweight the observed outcomes and compute the estimate Y* ~ pYa + (1 — p) Yb ■ 

4.3 Markov Factor Replacement 

The reweighting approach discussed in section 4.2 can in fact be applied under much less 
stringent conditions. Let us return to the ad placement problem to illustrate this point. 

The average number of ad clicks per page is often called click yield. Increasing the click 
yield usually benefits both the advertiser and the publisher, whereas increasing the revenue 
per page often benefits the publisher at the expense of the advertiser. The click yield is 
therefore a very useful metric when we reason with an isolation assumption that ignores the 
advertiser reactions to pricing changes. 
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userjntent u 





query x 










Figure 12: Estimating which average number of clicks per page would have been observed 
if we had used a different scoring model. 



Let u be a shorthand for all variables appearing in the Markov factorization of the ad 
placement structural equation model, 

P(w) = P(u,v)P(x\u)P(a\x,v)P(b\x,v)P{q\x,a) 

x P(s\a,q,b)P(c\a,q,b)P(y\s,u)P(z\y,c) . (4) 

Variable y was defined in section 3.1 as the set of user clicks. In the rest of the document, 
we slightly abuse this notation by using the same letter y to represent the number of clicks. 
We also write the expectation Y = E^p^) [y] using the integral notation 



Y = [ y P(w) . 



We would like to estimate what the expected click yield Y* would have been if we had 
used a different scoring function (figure 12). This intervention amounts to replacing the 
actual factor P(q\x, a) by a counterfactual factor P*(q | x, a) in the Markov factorization. 

P*(u) = P(u,v)P(x\u)P(a\x,v)P(b\x,v)P*(q\ x,a) 

x P(s\a,q,b)P(c\a,q,b)P(y\s,u)P(z\x,c) . (5) 

Let us assume, for simplicity, that the actual factor P(q\x, a) is nonzero everywhere. 
We can then estimate the counterfactual expected click yield Y* using the transformation 

v* f v*( \ f P *(gK°) p. n ^ 1 P*(qi\xi,ai) 

Y = i yP{Uj) = UnqJ^a)^ " ng> Pjq^-^) ' (6) 

where the data set of tuples (ai,Xi,qi,yi) is distributed according to the actual Markov 
factorization instead of the counterfactual Markov factorization. This data could therefore 
have been collected during the normal operation of the ad placement system. Each sample 
is reweighted to reflect its probability of occurrence under the counterfactual conditions. 

In general, we can use importance sampling to estimate the counterfactual expectation 
of any quantity l(uS) : 

Y* = f %)P» = [ /(«) S^PH « tfli ( ? ) 

Jul Jul n 
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with weights 



Wi = w(u)i) 



P(Wi) 



factors appearing in P*(wj) but not in P(cjj) 



factors appearing in P(tJi) but not in P*(cjj) 



(8) 



Equation (8) emphasizes the simplifications resulting from the algebraic similarities of 
the actual and counterfactual Markov factorizations. Because of these simplifications, the 
evaluation of the weights only requires the knowledge of the few factors that differ between 
P(w) and P*(w). Each data sample needs to provide the value of £(uii) and the values of all 
variables needed to evaluate the factors that do not cancel in the ratio (8). 

In contrast, the replaying approach (section 4.1) demands the knowledge of all factors 
of P*(w) connecting the point of intervention to the point of measurement £(oj). On the 
other hand, it does not require the knowledge of factors appearing only in P(w). 

Importance sampling relies on the assumption that all the factors appearing in the 
denominator of the reweighting ratio (8) are nonzero whenever the factors appearing in the 
numerator are nonzero. Since these factors represents conditional probabilities resulting 
from the effect of an independent noise variable in the structural equation model, this 
assumption means that the data must be collected with an experiment involving active 
randomization. We must therefore design cost-effective randomized experiments that yield 
enough information to estimate many interesting counterfactual expectations with sufficient 
accuracy. This problem cannot be solved without answering the confidence interval question: 
given data collected with a certain level of randomization, with which accuracy can we 
estimate a given counterfactual expectation? 

4.4 Confidence Intervals 

At first sight, we can invoke the law of large numbers and write 



For sufficiently large n, the central limit theorem provides confidence intervals whose width 
grows with the standard deviation of the product £{u) w(co). 

Unfortunately, when P(w) is small, the reweighting ratio w{uj) takes large values with 
low probability. This heavy tailed distribution has annoying consequences because the 
variance of the integrand could be very high or infinite. When the variance is infinite, the 
central limit theorem does not hold. When the variance is merely very large, the central 
limit convergence might occur too slowly to justify such confidence intervals. 

In other words, importance sampling works best when the actual distribution and the 
counterfactual distribution overlap. When the counterfactual distribution has significant 
mass in domains where the actual distribution is small, the few samples available in these 
domains receive very high weights. Their noisy contribution dominates the reweighted 
estimate. We can in fact obtain better estimates by containing the importance of the few 
samples drawn in poorly explored domains. The resulting bias can be bounded using prior 
knowledge, for instance with an assumption about the range of values taken by i(u)): 




0) 




(10) 
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We control the importance of these noise-inducing samples by replacing the importance 
sampling weights w(oj) by capped weights w{oj). Let R be the maximum weight value 
deemed acceptable, and l{c} be the indicator function, which is equal to 1 when condition 
c is true and otherwise. Zero- capped weights 

w(oj) = w(oj) t{w(u) < R} 

eliminate the contribution of poorly explored domains, whereas max- capped weights 

w{u) = min{u;(o;), R} 

merely contain their importance. In practice, we have obtained very consistent results using 
zero-capped weights and choosing R equal to the fifth largest reweighting ratio observed on 
the empirical data. 4 

The capped expectation 

r l n 

Y* = t(u) w(uS) P(w) m Y* = -J2^i)w{uji) . (11) 

Ju n i=i 

is much easier to estimate than (9) because the magnitude of the capped weights to (a;) is 
bounded by R. The estimation error Y*— Y* can then be split into two components Y*— Y* 
and Y* — Y* that can be bounded using confidence intervals that we call outer confidence 
intervals and inner confidence intervals. 

Since the capped weights are bounded, outer confidence intervals are easily obtained 
using either the central limit theorem or empirical Bernstein bounds (see appendix A. 2 for 
details). They have the form 

P{y*-e R < Y* < Y* + e R } > 1-5, (12) 

where we use symbol P instead of P to emphasize that this probability is not directly 
associated with the Markov factorization but represents the distribution of random draws 
of the sample uj\ . . . oj n . 

The difference Y*— Y* represents the contribution of the poorly explored domains to the 
expectation Y* . Since the samples do not provide reliable information about such domains, 
we bound this difference using assumption (10) : 

< Y* - Y* = [ [w{uj) - w[w)] P(w) < M [ [w{lo) - w{uj)] P(w) . (13) 

In order to estimate these bounds using the empirical data, observe that 

/ [w{u) - w(u)] P(o;) = / P(a; ) _ f P(fJ ) = i _ w* . 

where the quantity 

r l n 

W* = / w(u))P(w) « W* = -Yw(ui) 



4. This is a slight abuse because the theory calls for choosing R before seing the data. 
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is easy to estimate because the magnitude of the capped weight w(u) is conveniently 
bounded. We can then rewrite inequality (13) as 

< Y* - Y* < M (l - W*^ . (14) 

The standard techniques then yield an inner confidence interval of the form 

P{ Y* < Y* < Y* + M(l - W* + £r) } > 1 - S . (15) 

Putting (12) and (15) together yields our final confidence interval: 

P{ Y* - e R < Y* < Y* + M(l - W* + £ R ) + e R } > 1 - 25 . (16) 

Replacing the unbiased importance sampling estimator (9) by a capped importance 
sampling estimator (11) therefore leads to improved confidence intervals. This replacement 
also eliminates the need for the assumption Vw, P(w) > because we can define w(uj) by 
continuity whenever P(cu) = 0. The only difference is the need to consider the two cases 
P(w) =0 and P(w) >0 when deriving inequality (14). 

4.5 Interpreting the Confidence Intervals 

The estimation of the counterfactual expectation Y* can be inaccurate because the sample 
size is insufficient, or because the sampling distribution P(w) does not sufficiently explore 
the counterfactual conditions of interest. We argue in this section that the relative sizes 
of the outer and inner confidence intervals provide precious cues to determine whether we 
can continue collecting data using the same experimental setup or should adjust the data 
collection experiment in order to obtain a better coverage. 

Let the set £l R contain all the oj excluded from the computation of the zero-capped 
expectation Y, that is, 0,r = {oo : P*(w) > i?P(cj)}. The size of the inner confidence 
interval always exceeds the gap G R separating the upper and lower bound of inequality (14). 

G R = M(l-W*) = M f w(uj)P(lo) 

Since the gap is a positive decreasing function of R, 

G R Goo = M f «/(w)P(w) > 0, 

where = P|_r>o^k = { u '■ P*^) > anc ^ = 0} • Therefore the inner confidence 
interval size always exceeds the quantity G^ which represents the impact of unexplored 
domains on the counterfactual expectation Y* . Since increasing the sample cannot probe 
domain 0,^, the only way to reduce the inner confidence below the minimum gap is to 
collect data using a different distribution. 

In order to obtain a good confidence interval (16), we ideally would like to select a 
capping bound R that achieves a good compromise between the non-increasing gap Gr and 
the uncertainties £r and cr resulting from the sample size n and from the non-decreasing 
variances var[u;(u;)] and v&t[w(lu)£(u)]. 
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This is relatively easy when the weights (8) can be expressed as the ratio of discrete 
distributions defined on domains with relatively small cardinality. The gap and the variances 
are then piecewise constant functions of R. A single well chosen value of R can usually 
handle a broad range of realistic sample sizes. The innner confidence interval then measures 
the uncertainty associated with the insufficiently explored domain £Ir D Qoq. 

The situation is far more complex with continuous distributions. One one hand, we can 
select distributions P(w) that cover large domains and therefore ensure that G~oo is zero or 
close to zero. On the other hand, since the integral of the density sums to the unity, such 
distributions tend to have small densities, leading to potentially very large weights w(ui) and 
therefore extremely large variances 5 that cannot be countered with realistically large sample 
sizes. Furthermore, when R reaches large values, the variances var[tD(o;)] and var[u>(u;)^(u;)] 
are even harder to estimate than the capped expectation Y. This means in practice that 
we must conservatively pick a value of R that is well below the theoretically optimal choice. 
With such a setup, the inner confidence interval is an approximation of the gap Gr which 
measures the uncertainty associated with the insufficiently explored domain £Ir. 

In both cases, assuming that the capping bound R has been chosen competently, the 
two components of the zero-capped confidence intervals (16) provide precious guidelines: 

• The inner confidence interval (15) witnesses the uncertainty associated with the do- 
main Gr insufficiently explored by the actual distribution. A large inner confidence 
interval suggests that the most practical way to improve the estimate is to adjust the 
data collection experiment in order to obtain a better coverage of the counterfactual 
conditions of interest. 

• The outer confidence interval (12) represents the uncertainty that results from the 
limited sample size. A large outer confidence interval indicates that the sample is too 
small. To improve the result, we simply need to continue collecting data using the 
same experimental setup. 

This interpretation is less obvious in the case of max-capped weights. It nevertheless remains 
useful because both capping methods usually produce comparable numerical results. 

4.6 Experimenting with Mainline Reserves 

We return to the ad placement problem to illustrate the reweighting approach and the 
interpretation of the confidence intervals. Manipulating the reserves R p {x) associated with 
the mainline positions (figure 1) controls which ads are prominently displayed in the mainline 
or displaced into the sidebar. 

We seek in this section to answer counterfactual questions of the form: 

"How would the ad placement system have performed if we had scaled the mainline 
reserves by a constant factor p, without incurring user or advertiser reactions?'''' 

Randomization was introduced using a modified version of the ad placement engine. 
Before determining the ad layout (see section 2.1), a random number e is drawn according 

5. Consider for instance two normal distribution with unit variance and means respectively equal to zero 

2 

and m. The ratio of their densities follows a log-normal distribution of variance e m — 1. 
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to the standard normal distribution AA(0,1), and all the mainline reserves are multiplied 
by m = pe~ a l 2+ae . Such multipliers follow a log-normal distribution 6 whose mean is p 
and whose width is controlled by a. This effectively provides a parametrization of the 
conditional score distribution P(q\x, a) (see figure 5.) 

The Bing search platform offers many ways to select traffic for controlled experiments 
(section 2.2). In order to match our isolation assumption, individual page views were 
randomly assigned to traffic buckets without regard to the user identity. The main treatment 
bucket was processed with mainline reserves randomized by a multiplier drawn as explained 
above with p = 1 and a = 0.3. With these parameters, the mean multiplier is exactly 1, 
and 95% of the multipliers are in range [0.52,1.74]. Samples describing 22 million search 
result pages were collected during five consecutive weeks. 

We then use this data to estimate what would have been measured if the mainline reserve 
multipliers had been drawn according to a distribution determined by parameters p* and a* . 
This is achieved by reweighting each sample oj% with 

= P*{g i \x i ,a i ) = p[mj \ p* , a*) 
P(qi\xi,Oi) p(mi;p,a) 

where mi is the multiplier drawn for this sample during the data collection experiment, 
and p(t ; p, a) is the density of the log-normal multiplier distribution. 

Figure 13 reports results obtained by varying p* while keeping a* = a. This amounts 
to estimating what would have been measured if all mainline reserves had been multiplied 
by p* while keeping the same randomization. The curves bound 95% confidence intervals 
on the variations of the average number of mainline ads displayed per page, the average 
number of ad clicks per page, and the average revenue per page, as functions of p* . The 
inner confidence intervals, represented by the filled areas, grow sharply when p* leaves the 
range explored during the data collection experiment. The average revenue per page has 
more variance because a few very competitive queries command high prices. 

In order to validate the accuracy of these counterfactual estimates, a second traffic 
bucket of equal size was configured with mainline reserves reduced by about 18%. The 
green hollow circles in figure 13 represent the metrics effectively measured on this bucket 
during the same time period. The effective measurements and the counterfactual estimates 
match with high accuracy. 

Finally, in order to measure the cost of the randomization, we also ran the unmodified 
ad placement system on a control bucket. The brown filled circles in figure 13 represent 
the metrics effectively measured on the control bucket during the same time period. The 
randomization caused a small but statistically significant increase of the number of mainline 
ads per page. The click yield and average revenue differences are not significant. 

This experiment shows that we can obtain accurate counterfactual estimates with af- 
fordable randomization strategies. However, this nice conclusion does not capture the true 
practical value of the counterfactual estimation approach. 

6. More precisely, lnM(yb, a 2 ) with fi = <r 2 /2 + log p. 
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Figure 13: Estimated variations of three performance metrics in response to mainline re- 
serve changes. The curves delimit 95% confidence intervals for the metrics we 
would have observed if we had increased the mainline reserves by the percentage 
shown on the horizontal axis. The filled areas represent the inner confidence in- 
tervals. The hollow squares represent the metrics measured on the experimental 
data. The hollow circles represent metrics measured on a second experimental 
bucket with mainline reserves reduced by 18%. The filled circles represent the 
metrics effectively measured on a control bucket running without randomization. 
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4.7 More on Mainline Reserves 

The main benefit of the counterfactual estimation approach is the ability to use the same 
data to answer a broad range of counterfactual questions. Here are a few examples of 
counterfactual questions that can be answered using data collected using the simple mainline 
reserve randomization scheme described in the previous section: 

• Different variances - Instead of estimating what would have been measured if we 
had increased the mainline reserves without changing the randomization variance, 
that is, letting a* = a, we can use the same data to estimate what would have been 
measured if we had also changed a. This provides the means to determine which level 
of randomization we can afford in future experiments. 

• Point-wise estimates - We often want to estimate what would have been measured if 
we had set the mainline reserves to a specific value without randomization. Although 
computing estimates for small values of a often works well enough, very small values 
lead to large confidence intervals. 

Let Y u (p) represent the expectation we would have observed if the multipliers m had 
mean p and variance v. We have then Y v (p) = E m [E[y|m] ] = E m [Yo(m)]. Assuming 
that the point- wise value Yq is smooth enough for a second order development, 

Y u (p) « E m [Y (p) + (m-p)Y^p) + (m-p) 2 Y^(p)/2] = Y (p) + uY^(p)/2 . 

Although the reweighting method cannot estimate the point- wise value Y${p) directly, 
we can use the reweighting method to estimate both Y u {p) and Y2 V {p) with acceptable 
confidence intervals and write Y${p) ~ 2Y v {p) — Y2 V {p) (Goodwin, 2011). 

• Query -dependent reserves - Compare for instance the queries "car insurance" and 
"common cause principle" in a web search engine. Since the advertisement potential 
of a search varies considerably with the query, it makes sense to investigate various 
ways to define query-dependent reserves (Charles and Chickering, 2012). 

The data collected using the simple mainline reserve randomization can also be used to 
estimate what would have been measured if we had increased all the mainline reserves 
by a query-dependent multiplier p*{x). This is simply achieved by reweighting each 
sample Ui with 

_ _ P*(<& | Xj,ai) _ p(mj ; p*(xj) , a) 
P(qi\xi,ai) p(mi;p,a) 

Considerably broader ranges of counterfactual questions can be answered when data 
is collected using randomization schemes that explore more dimensions. For instance, in 
the case of the ad placement problem, we could apply an independent random multiplier 
for each score instead of applying a single random multiplier to the mainline reserves only. 
However, the more dimensions we randomize, the more difficult it is to collect data that 
effectively explores all these dimensions. The next section presents a portfolio of methods 
to work around this problem. 
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Figure 14: The reweighting variable(s) must intercept all causal paths from the point of 
intervention to the point of measurement. 



5. Importance Sampling Toolbox 

This section describes techniques that improve our ability to give answers to counterfactual 
questions. Leveraging the structure of the causal graph provides opportunities to improve 
the inner confidence intervals. Exploiting an invariant prediction function can improve outer 
confidence intervals when comparing the expectations of a same variable under two different 
counterfactual distributions. Computing the derivatives of a counterfactual expectation 
with respect to parameters describing the intervention provides directional answers. 



5.1 Better Reweighting Variables 

Many search result pages come without eligible ads. Regardless of the reserves, we know 
with certainty that such pages will have zero mainline ads, receive zero clicks, and generate 
zero revenue. The results shown in figure 13 were in fact helped by a little optimization: 
when a sample Ui describes a page without ads, the weights w(uJi) are forced to 1. This 
does not change the estimate since these weights are multiplied by zero, but this reduces 
the variance of the weights and therefore improves the inner confidence intervals. 

Such prior knowledge is in fact encoded in the structure of the causal graph. For instance 
we know that users make click decisions without knowing which scores were computed by 
the ad placement engine, and without knowing the prices charged to advertisers. The ad 
placement causal graph encodes this knowledge by showing the clicks y as direct effects of 
the user intent u and the ad slate s. This implies that the exact value of the scores q does 
not matter to the clicks y as long as the ad slate s remains the same (see figure 14). 

Because the causal graph has this special structure, we can simplify both the actual 
and counterfactual Markov factorizations (4) (5) without eliminating the variable y whose 
expectation is sought. Successively eliminating variables z, c, and q gives: 

P(u, v, x, a, b, s, y) = P(n, v) P(x \ u) P(a | x, v ) P(b | x, v ) P(s | x, a, b) P(y \ s, u) , 
P*(u, v, x, a, b, s, y) = P(u,v) P(x \ u) P(a | x, v) P(6 1 x, v) P*(s | x, a, b) P(y \ s, u) ■ 

The conditional distributions P(s\x,a,b) and P*(s\x,a,b) did not originally appear in 
the Markov factorization. They are defined by marginalization as a consequence of the 
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elimination of the variable q representing the scores. 

P(s\x,a,b) = / P(s | a, q, b) P(q \ x, a) , P*(s | x, a, b) = / P(s | a, q, b) P*(q \ x, a) . 

Jq Jq 

We can estimate the counterfactual click yield Y* using these simplified factorizations: 

Y = / y P (u, v, x, a, b, s, y) = / y —— — P{u,v,x,a,b,s,y) 

J... J... P(s\x,a,b) 

~ 1 P*( s i | x ii Q-ii b%) (17) 

n ~[ % p ( s i I Xi,ai,bi) 

Comparing (6) and (17) makes the difference very clear: instead of computing the ratio 
of the probabilities of the observed scores under the counterfactual and actual distributions, 
we compute the ratio of the probabilities of the observed ad slates under the counterfactual 
and actual distributions. As illustrated by figure 14, we now distinguish the reweighting 
variable from the intervention. 

In general, the algebraic manipulation described above is possible whenever the reweight- 
ing variable (or variables) intercepts all the causal paths connecting the point of intervention 
to the measurement point. The numerator and the denominator of the reweighting ratio 
are then computed by collapsing all the factors connecting the intervention point to the 
intercepting reweighting variable along each of these causal paths. In order to be able to 
evaluate the weights, these factors must of course be known. We can only improve the 
confidence intervals by using reweighting variables that are closer to the measurement point 
and respect these constraints. 

We have reproduced the experiments described in section 4.6 with the counterfactual 
estimate (17) instead of (6). For each example Ui, we determine which range [mf 1 ^, mf im ] 
of mainline reserve multipliers could have produced the observed ad slate Sj, and then 
compute the reweighting ratio using the formula: 

P*(a i |s i ,o i ,6i) _ ^(mf ax ; p* , a*) - *(mf n ; p*,a*) 

Wi 



P{s l \x i ,a l ,b l ) ^(mf^; p,a) - ^(mf 11 "; p,a) ' 

where ^(m; p, a) is the cumulative of the log-normal multiplier distribution. 

Figure 15 shows counterfactual estimates obtained using the same data as figure 13. 
The obvious improvement of the inner confidence intervals significantly extends the range of 
mainline reserve multipliers for which we can compute accurate counterfactual expectations 
using this same data. The figure does not report the average revenue per page because 
the revenue z also depends on the scores q through the click prices c. This causal path is 
not intercepted by the ad slate variable s alone. Although reweighting according to both 
the ad slate s and the click prices c does not bring much benefit, we have obtained nice 
counterfactual revenue estimates by reweighting according to both the ad slate s and a new 
variable representing solely the click prices of those ads that actually received clicks. 

Figure 16 shows how this approach can be extended to the randomization of all the scores 
using a collection of independent log-normal multipliers. The weights are then computed as 
the ratio of the probabilities of the observed ad slate under the counterfactual and actual 
multiplier distributions. Details will be provided in a forthcoming publication. 
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Figure 15: Estimated variations of two performance metrics in response to mainline reserve 
changes. These estimates were obtained using the ad slates s as reweighting 
variable. Compare the inner confidence intervals with those shown in figure 13. 
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Figure 16: A distribution on the scores q induce a distribution on the possible ad slates s. 
If the observed slate is slate2, the reweighting ratio is 34/22. 
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5.2 Counter factual Differences 

The designer of a learning system often has to evaluate which of two interventions is most 
likely to improve the system performance. This can be achieved by estimating the differ- 
ence Y + — Y* of expectations of a same quantity £{uS) under two different counterfactual 
distributions P + (w) and P*(oj). 

These expectations are often affected by variables whose value is left unchanged by the 
interventions under consideration. For instance, seasonal effects can have very large effects 
on the number of ad clicks. We can then expect that these variables affect both Y + and 
Y* in similar ways. This provides an opportunity to obtain substantially better confidence 
intervals for the difference Y + — Y* . 

In addition to the notation uj representing all the variables in the structural equation 
model, we use notation v to represent all the variables that are not direct or indirect effects 
of variables affected by the interventions under consideration. 

Let C( v ) be a known function believed to be a good predictor of the quantity l(oS) whose 
counterfactual expectation is sought. Since P*(u) = P(t>), the following equality holds 
regardless of the quality of this prediction: 



Y* = I %)P» = / C(v)P*(v) + / [1(cj)-((v)] P» 

C(v) P(v) + [ [£(u) - ((v)] w(u) PM . (18) 

J w 

Decomposing both Y + and Y* in this way and computing the difference, 

f 1 n 

Y+-Y* = / [£(u) - C(v)] Aw(u) P(u) « -Et^-CWlAicW, 
Ju) n . =1 

a , v P + (^) P*M P + (^) - P*(w) , , 

The construction of confidence intervals for such an estimator of the difference Y + — Y* 
demands additional bookkeeping because both the weights Auf(wj) and the integrand 
l(u)) — Q{v) can now be positive or negative. Details are provided in appendix A. 3. 2. 

The outer confidence interval size is reduced if the variance of the residual t(uS) — Q{v) 
is smaller than the variance of the original variable t(u). For instance, a suitable predictor 
function C{v) can significantly capture the seasonal click yield variations regardless of the 
interventions under consideration. Even a constant predictor function can considerably 
change the variance of the outer confidence interval. Therefore, in the absence of better 
predictor, we still can ( and always should ) center the integrand using a constant predictor. 

5.3 Estimating Derivatives 

We now consider interventions that depend on a continuous parameter 6. For instance, we 
might want to know what the performance of the ad placement engine would have been if 
we had used a parametrized scoring model. 

Let ¥ e {(jj) represent the counterfactual Markov factorization associated with this inter- 
vention. Let Y 9 be the counterfactual expectation of £(co) under distribution P e . Computing 
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the derivative of (18) immediately gives 



r 1 

/ [%)-c( v )] w ;h « - J2 [*(<*)- c(«i)]«4(wi) 



, v P°(w) , , du> e (w) , , dlogPV) 

with w 9 (u) = and u,£(a,) = — = ; . (20) 

Replacing the expressions P(w) and P e (u;) by the corresponding Markov factorizations 
gives many opportunities to simplify the reweighting ratio w' g (uj). The term wg(uj) simplifies 
as shown in (8). The derivative of logP e (a;) depends only on the factors parametrized by 0. 
Therefore, in order to evaluate w'q(ui), we only need to know the few factors affected by the 
intervention. 

Higher order derivatives can be estimated using the same approach. For instance, 

8 2 Y e r 1 n 

[f( W )-((,)] W f 3 HPH « [/(«0-C(«i) ]<■(<*) 



d 2 w e (u) dlogPV) dlogP e (q;) a 2 logPV, 

Wlth w ^ u) = -WM =Mu) ~M i de— + we{u}) de ld e 3 • (21) 

The second term in u/L(ui) vanishes when 9i and 6j parametrize distinct factors in P e (cj). 



5.4 Infinitesimal Interventions 

Expression (20) becomes particularly attractive when P(oj) = P s (oj), that is, when one seeks 
derivatives that describe the effect of an infinitesimal intervention on the system from which 
the data was collected. The resulting expression is then identical to the celebrated policy 
gradient (Williams, 1992) which expresses how the accumulated rewards in a reinforcement 
learning problem are affected by small changes of the parameters of the policy function. 



F)Y e r 1 n 

°— = j [/(«)- c(«)K(w)i*(«) « ~ -Cte) KM 

<91og P^(cj) 

where oj{ are sampled i.i.d. from P e and w'g{oj) = — . (22) 

da 

Sampling from ~P e (uj) eliminates the potentially large ratio we{ui) that usually plagues 
importance sampling approaches. Choosing a parametrized distribution that depends 
smoothly on is then sufficient to contain the size of the weights v/g(u). Since the weights 
can be positive or negative, centering the integrand with a prediction function C( v ) remains 
very important. Even a constant predictor £ can substantially reduce the variance 

var[(£(w) - C) w' e {u)\ = var[^(w) w' e {u) - C,w' e {u) ] 

= vax[£(uj) w'g(u)] — 2 (cov[£(ui) wq(u), w'q{uS) \ + ( 2 va.r[w' d (co)] 

, , e , cow[lwLw' Q \ E[£w' 2 } 
whose minimum is reached for C = r = ^— . 

^ r„../ 1 in r / 2,1 



E[w' e 
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We sometimes want to evaluate expectations under a counterfactual distribution that is 
too far from the actual distribution to obtain reasonable confidence intervals. Suppose, for 
instance, that we are unable to reliably estimate which click yield would have been observed 
if we had used a certain parameter 9* for the scoring models. We still can estimate how 
quickly and in which direction the click yield would have changed if we had slightly moved 
the current scoring model parameters 9 in the direction of the target 9* . Although such an 
answer is not as good as a reliable estimate of Y 6 * , it is certainly better than no answer. 



5.5 Off- Policy Derivatives 

Estimating derivatives using data sampled from a distribution P(w) different from P 9 (u;) 
is more challenging because the ratios wg(wi) in equation (20) can take very large values. 
However it is comparatively easy to estimate the derivatives of lower and upper bounds 
computed using max-capped weights. 

Let Wg and wf be respectively the zero-capped and max-capped weights, 

Wq(lj) = wq(uj) t{wg(uj) < R} and wf(u) = mm{wg(uj), R} . 

We assume that the parametrized probability distribution P 9 (u}) is regular enough to ensure 
that all the derivatives of interest are defined and that the event {wg{u) = R} has probability 
zero. Furthermore, in order to simplify the exposition, the following derivation does not 
leverage an invariant predictor function. 

Proceeding as in section 4.4, we define the quantities 

Y e = [ wf(u) P(w) and W e = f wtf(u>) P(u) (23) 

Jul Jul 

and obtain the inequality 

Y e < Y 9 < Y e + M (1 - W e ) . (24) 

In order to obtain reliable estimates of the derivatives of these upper and lower bounds, 
it is of course sufficient to obtain reliable estimates of the derivatives of Y® and W 9 . By 
separately considering the cases wg(uj) < R and wg(ui) > R, we easily obtain the relation 

M , f , dwf(u) 7I N dlogPV) , , . / 7-* 

< M = —fjjj- 1 = w z g(u) & dQ y ' when wg(u) + R 

and, thanks to the regularity assumptions, we can write 
dY e 



09 



dW 6 



1 

i=i 

1 n 

/ w ;'hp(.) » -Y. w e'(^), 

Jul Tl . , 



Estimating these derivatives is considerably easier than using approximation (20) because 
they involve the bounded quantity Wq(u>) instead of the potentially large ratio wg(u). It 
is still necessary to choose a sufficiently smooth sampling distribution P(w) to limit the 
magnitude of dlogP 6 / 89. 
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Figure 17: Single design - A preferred parameter value 9* is determined using randomized 
data collected in the past. Test data is collected after loading 9* into the system. 



Such derivatives are very useful to drive optimization algorithms. Assume for instance 
that we want to find the parameter 9 that maximizes the counterfactual expectation Y 9 . 
Maximizing the estimate obtained using approximation (7) is unwise because it could reach 
its maximum for a value of 9 that is poorly explored by the actual distribution. As explained 
in section 4.5, the gap between the upper and lower bound reveals the uncertainty associated 
with insufficient exploration. Therefore, maximizing an estimate of the lower bound (24) 
ensures that the optimization algorithm finds a trustworthy answer. 

6. Learning 

Optimizing of a counterfactual estimate fundamentally is a learning procedure. The main 
purpose of this section is to demonstrate that optimizing a counterfactual expectation esti- 
mate is a sound learning principle and to outline its relation to well known learning methods 
for bandit and reinforcement learning problems. 

6.1 A Learning Principle 

We consider the simple learning setup illustrated in figure 17. Training data is collected 
during a single data collection experiment designed using prior knowledge acquired in an 
unspecified manner. A preferred parameter value 9* is then determined using the training 
data and loaded into the system. The goal is of course to observe a good performance on 
data collected during a test period that takes place after the switching point. 

The isolation assumption introduced in section 3.2 states that the exogenous variables 
are drawn from an unknown but fixed joint probability distribution. This distribution 
induces a joint distribution P(w) on all the variables oj appearing in the structural equation 
model associated with the parameter 9. Therefore, if the isolation assumption remains 
valid during the test period, the test data follows the same distribution P e (u) that would 
have been observed during the training data collection period if the system had been using 
parameter 9* all along. 

Therefore we can state this problem as the optimization of the expectation Y e of the 
reward £{uj) with respect to the distribution 

max Y e = ( £(u) P e (w) , (25) 

on the basis of a finite set of training examples ui, . . . , uj n sampled from distribution P(w). 
Following section 5.5, we propose to leverage inequality (24) and, as a learning principle, 
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to optimize an empirical estimate Y e of the lower bound Y e : 

9* = argmaxY 9 . (26) 
e 

We shall now discuss the statistical basis of this learning principle. 7 

6.2 Uniform Confidence Intervals 

As discussed in section 4.4, inequality (24), 

Y e < Y 9 < Y e + M(l - W e ) , 

where 

r 1 n 

Y e = £( U )v%(u>)P(u) « Y e = -J2^)wf(co t ) , 
Juj n , =1 

r 1 n 

W 6 = / wf(uj) P(w) « IY e = -5> e >,), 
j w n . =1 

leads to confidence intervals (16) of the form 

> 0, yO P{ Y e - e R < Y e < Y e + M(l - VF e + ^) + e R } > 1 - 5 . (27) 

Both en and converge to zero in inverse proportion to the square root of the sample size n. 
They also increase at most linearly in log 5 and depend on both the capping bound R and 
the parameter 6 through the empirical variances (see appendix A. 2.) 

Such confidence intervals are insufficient to provide guarantees for a parameter value 0* 
that depends on the sample. In fact, the optimization (26) procedure is likely to select 
values of 6 for which the inequality is violated. We therefore seek uniform confidence 
intervals (Vapnik and Chervonenkis, 1968), simultaneously valid for all values of 9. 

• When the parameter is chosen from a finite set J 7 , applying the union bound to the 
ordinary intervals (27) immediately gives the uniform confidence interval : 

P{ yd e J 7 , Y 6 -e R <Y e < Y 9 +M{l-W 6 +Z R )+e R } > 1-|^|<5. 



• Following the pioneering work of Vapnik and Chervonenkis, a broad choice of mathe- 
matical tools have been developed to construct uniform confidence intervals when the 
set J- is infinite. For instance, appendix A. 4 leverages uniform empirical Bernstein 
bounds (Maurer and Pontil, 2009) and obtains the uniform confidence interval 

P{V^G-F, Y e -e R <Y e <Y e +M(1-W +£ R )+e R ] > 1-M(n) 5 , (28) 

7. The idea of maximizing the lower bound may surprise readers familiar with the UCB algorithm for 
multi-armed bandits (Auer et al., 2002). UCB performs exploration by maximizing the upper confidence 
interval bound and updating the confidence intervals online. Exploration in this setup results from the 
active system randomization during the offline data collection. See also section 6.4. 
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where the growth function Ai(n) measures the capacity of the family of functions 

{ f g : lo m- e(u)v%(u) , ge:oj^ wf{u) , V0 e T } . (29) 

Many practical choices of P*(w) lead to functions Ai(n) that grow polynomially with 
the sample size. Because both er and £r are 0(n~ l l 2 log 5), they converge to zero 
with the sample size when one maintains the confidence level 1— A4(n) 5 equal to a 
predefined constant. 

The intepretation of the inner and outer confidence intervals (section 4.5) also applies to 
the the uniform confidence interval (28). When the sample size is sufficiently large and the 
capping bound R chosen appropriately, the inner confidence interval reflects the upper and 
lower bound of inequality (24). 

Therefore, Y e * is close to the maximum of the lower bound of inequality (24) which 
essentially represents the best performance that can be guaranteed using training data 
sampled from P(w). Meanwhile, the upper bound reveals which values of 9 could potentially 
offer better performance but have been insufficiently probed by the sampling distribution. 
Both bounds are estimated by the bounds of the inner confidence interval (figure 18). 

6.3 Tuning Ad Placement Auctions 

We now present an application of this learning principle to the optimization of auction 
tuning parameters in the ad placement engine. Despite increasingly challenging engineering 
difficulties, comparable optimization procedures can obviously be applied to larger numbers 
of tunable parameters. 

Lahaie and McAfee (2011) propose to account for the uncertainty of the click probability 
estimation by introducing a squashing exponent a to control the impact of the estimated 
probabilities on the rank scores. Using the notations introduced in section 2.1, and assuming 
that the estimated probability of a click on ad i placed at position p after query x has the 
form qi p (x) = j p (3i(x) (see appendix A.l), they redefine the rank-score 

r ip (x) = 7 P bi (3i(x) a . 
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Figure 19: Level curves associated with the average number of mainline ads per page (red 
curves, from —6% to +10%) and the average estimated advertisement value 
generated per page (black curves, arbitrary units ranging from 164 to 169) that 
would have been observed for a certain query cluster if we had changed the 
mainline reserves by the multiplicative factor shown on the horizontal axis, and 
if we had applied a squashing exponent a shown on the vertical axis to the 
estimated click probabilities qi tP (x). 



Using a squashing exponent a < 1 reduces the contribution of the estimated probabilities 
and increases the reliance on the bids bi placed by the advertisers. 

Because the squashing exponent changes the rank-score scale, it is necessary to simul- 
taneously adjust the reserves in order to display comparable number of ads. In order to 
estimate the counterfactual performance of the system under interventions affecting both 
the the squashing exponent and the mainline reserves, we have collected data using a ran- 
dom squashing exponent following a normal distribution, and a mainline reserve multiplier 
following a log-normal distribution as described in section 4.6. Samples describing 12 million 
search result pages were collected during four consecutive weeks. 

Following Charles and Chickering (2012), we consider separate squashing coefficients ctk 
and mainline reserve multipliers pk per query cluster k £ {1..K}, and, in order to avoid 
negative user or advertiser reactions, we seek the auction tuning parameters aj~ and pt that 
maximize an estimate of the advertisement value 8 subject to a global constraint on the 
average number of ads displayed in the mainline. Because maximizing the advertisement 
value instead of the publisher revenue amounts to maximizing the size of the advertisement 
pie instead of the publisher slice of the pie, this criterion is less likely to simply raise the 

8. The value of an ad click from the point of view of the advertiser. The advertiser payment then splits the 
advertisement value between the publisher and the advertiser. 
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Figure 20: Sequential design - The parameter 6t of each data collection experiment is de- 
termined using data collected during the previous experiments. 



prices without improving the ads. Meanwhile the constraint ensures that users are not 
exposed to excessive numbers of mainline ads. 

We then use the collected data to estimate the off-policy derivatives of the lower bound of 
the counterfactual expectations of the advertiser value and of the counterfactual expectation 
of the number of mainline ads per page. Figure 19 shows the corresponding level curves 
for a particular query cluster. We did not use an upper bound for the number of mainline 
ads because this estimate is always much more accurate that the estimate of the value (see 
also figure 13). The ability to estimate these derivatives is then sufficient to run a simple 
optimization algorithm and determine the optimal auction tuning parameters. 

The obvious alternative (see Charles and Chickering, 2012) consists in replaying the 
auctions with different parameters and simulate the user using a click probability model. 
However, it may be unwise to rely on a click probability model to estimate the best value 
of a squashing coefficient that is expected to compensate for the uncertainty of the click 
prediction model itself. The counterfactual approach described here avoids the problem 
because it does not rely on a click prediction model to simulate users. Instead it estimates 
the counterfactual peformance of the system using the actual behavior of the users collected 
under moderate randomization. 

6.4 Sequential Design 

Confidence intervals computed after a first randomized data collection experiment might 
not offer sufficient accuracy to choose a definitive value of the parameter 9. It is generally 
unwise to simply collect additional samples using the same experimental setup because the 
current data already reveals information (figure 18) that can be used to design a better 
data collection experiment. Therefore, it seems natural to extend the learning principle 
discussed in section 6.1 to a sequence of data collection experiments. The parameter 0t 
characterizing the t-th experiment is then determined using samples collected during the 
previous experiments (figure 20). 

Although it is relatively easy to construct convergent algorithms for the design of se- 
quential experiments, achieving the best learning performance is notoriously difficult (e.g., 
Wald, 1945) because the selection of parameter 9t involves a trade-off between exploita- 
tion, that is, the maximization of the immediate reward y , and exploration, that is, the 
collection of samples potentially leading to better Y e in the more distant future. 

The exploration exploitation trade-off is well understood in the case of multi-armed 
bandits (Gittins, 1989; Auer et al., 2002; Audibert et al, 2007; Seldin et al, 2012) because 
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the analysis can leverage an essential property of multi-armed bandits: the outcome ob- 
served after performing a particular action brings no information about the value of other 
actions. In practice, such an assumption is both unrealistic and pessimistic. For instance, 
the outcome observed after displaying a certain ad in response to a certain query brings 
very useful information about the value of displaying similar ads on similar queries. 

Although little theoretical guidance is available in the general case, experience suggests 
that simple exploration heuristics perform surprisingly well. In fact, even in the simple 
case of multi-armed bandits, excellent empirical results have been obtained using Thomp- 
son sampling (Chapelle and Li, 2011) or using fixed exploration strategies (Vermorel and 
Mohri, 2005; Kuleshov and Precup, 2010). Therefore, it is often practical to simply set 
up each experiment by maximizing Y e as described in section 6.1, subject to additional 
ad-hoc constraints ensuring that each successive experiment guarantees a minimum level of 
exploration. 

7. Equilibrium Analysis 

All the methods discussed in this contribution rely on the isolation assumption presented 
in section 3.2. This assumption lets us interpret the samples as repeated independent 
trials that follow the pattern defined by the structural equation model and are amenable to 
statistical analysis. 

The isolation assumption is in fact a component of the counterfactual conditions under 
investigation. For instance, in section 4.6, we model single auctions (figure 3) in order to 
empirically determine how the ad placement system would have performed if we had changed 
the mainline reserves without incurring a reaction from the users or the advertisers. 

Since the future publisher revenues depend on the continued satisfaction of users and 
advertisers, lifting this restriction is highly desirable. 

• We can in principle work with larger structural equation models. For instance, figure 4 
suggests to thread single auction models with additional causal links representing the 
impact of the displayed ads on the future user goodwill. However, there are practical 
limits on the number of trials we can consider at once. For instance, it is relatively 
easy to simultaneously model all the auctions associated with the web pages served 
to a same user during a thirty minute web session. On the other hand, it would 
be challenging to consider several weeks worth of auctions in order to model their 
accumulated effect on the continued satisfaction of users and advertisers. 

• We can sometimes use problem-specific knowledge to construct alternate performance 
metrics that anticipate the future effects of the feedback loops. For instance, in 
section 6.3, we optimize the advertisement value instead of the publisher revenue. 
Since this alternative criterion takes the advertiser interests into account, it can be 
viewed as a heuristic proxy for the future revenues of the publisher. 

This section proposes a third way to take feedback loop into account. Using data collected 
while the system was at equilibrium, we describe empirical methods to determine how an 
infinitesimal intervention would have displaced the equilibrium: 
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Figure 21: The new variable g is a measure of the relevance of the displayed ads. We model 
the user feedback loop by letting the clicks y depend on new parameters gk 
representing the average relevance experienced by each user in the past. 



11 How would the system have performed during the data collection period if a small 
change d6 had been applied to the model parameter 9 and the equilibrium had 
been reached before the data collection period." 

We first outline the main idea using the example of the user reactions to interventions 
on the ad placement system. We then describe a more sophisticated framework modeling 
the reactions of rational advertisers to such interventions. 

7.1 Modeling User Reactions 

Since displaying irrelevant advertisement messages has known negative effects, various ways 
to measure the ad relevance have been designed. For instance, human labelers can be asked 
to score the ads displayed in response to a particular query. The new variable g in figure 21 
represents the measured relevance of the ad slate s displayed in response to query x. 

Besides the parameter controlling the conditional probability distribution P e (s\x,a) 
that represents the scoring models, we introduce new parameters gi ■ ■ ■ gx that represent 
the average relevances anticipated by each user on the basis of their past experience, and 
we assume that each of these parameters affects the conditional click probability P^(y|s, u) 
of the corresponding user. This model relies on the assumption that the relevance measure 
is good enough to capture how the past experience of the users affects their ad clicks. 9 

Let pk denote the probability P{user = k} that the web page is served to user k. The 
following counterfactual expectations functions then express the expected performance and 

9. Although a poor experience can also drive users to competing web sites, modeling how the publisher loses 
users (e.g. because of poor relevance) or acquires new users (e.g. with marketing initiatives) is beyond 
the scope of this section. 
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relevance as a function if the user anticipated relevances. 

Yk(0,9k) = [ P er9 -(u;\user = k) , 

Y(9,g 1 ...g K ) = I l(u) P e ~^'(uj) = J> fc Y k (9,g k ) , 
JuJ k 

G k (9,g k ) = I g{uj) P^-'H user = fc) , 

J U) 

G{9,g 1 ...g K ) = [ g(u>) P d ^(u) = "£ Pk G k (9,g k ) ■ 
Jw k 

Immediately after an intervention on the ad placement engine, the user anticipations g k 
are incorrect because they are based on experiences that no longer represent the system 
performance. After a certain time, the user relevance anticipations g k match the actual 
expectations G k (0,g k ) and the system returns to equilibrium. 

Our analysis relies on a quasi-static assumption: we assume that the publisher changes 
the parameter 9 so slowly that the system remains at equilibrium at all times. Therefore, 
in response to an infinitesimal change d9 of the scoring model parameter, 10 

Vfe dg k = dG k = d ^d9+ d ^-dg k = ^ d9 . (30) 

We can then express the variations of the counterfactual expectation Y(9,g\ . . .gx)- 

« - - 

Each partial derivative describes how the average click probability of a single user 
changes with his or her anticipated ad relevance. In order to permit the estimation of these 
derivatives, we make the additional assumption that all users respond in the same manner: 

dYt dY K A dY 



dgi dg K dg 

We then obtain our final answer 



(31) 



This expression describes how the expectation Y changes when the publisher applies an 
infinitesimal change d9 to the scoring parameter 9 and the users adjust their relevance 
anticipations g k in response. Therefore, if we can empirically estimate the three partial 
derivatives appearing in equation (32), we can estimate how infinitesimal changes of the 
scoring model parameter 6 affects the performance of the ad placement engine measured 
after incurring the user reaction. 



10. The specific structure of the user feedback loop (see figure 21) ensures that the relevance g does not 
depend on the relevance anticipations g^. The resulting simplification dGk /dgk = facilitates the 
derivation but is not an essential requirement of the proposed method. 
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Fi gure 22: Advertisers select the bid amounts b a on the basis of the past number of clicks y a 
and the past prices z a observed for the corresponding ads. 



• Estimating the partial derivatives ^ and ^ is a straightforward application of the 
policy gradient method discussed in section 5.4. 

• Estimating ^ is less direct because we cannot algorithmically select g randomly 
before each auction. However, we can organize a data collection experiment in two 
successive phases. During the priming phase, randomly selected users are exposed 
to ads selected by a slightly degraded version of the ad placement engine. During 
the second phase, all users are again treated identically. However, the ad relevance 
anticipations of the users selected during the priming phase still reflects the lower 
relevance experienced during the priming phase. Their lower click probabilities then 
reveal the partial derivatives of intest. This experiment also reveals how long users 
take to recover after being exposed to less relevant ads. 

The method described in this section can of course be refined in many ways. For 
instance, we could consider multiple relevance signals, multiple averaging methods, and 
multiple clusters of users assumed to respond identically to relevance changes. The principle 
remains the same. 

7.2 Modeling Rational Advertisers 

The ad placement system is an example of game where each actor furthers his or her 
interests by controlling some aspects of the system: the publisher controls the placement 
engine parameters, the advertisers control the bids, and the users control the clicks. 

The previous section relies on assumption (31) and on ad-hoc experiments to determine 
how users react to changes in anticipated relevance. This section assumes that the adver- 
tisers are rational and therefore always maximize their economic interests. Although there 
are more realistic ways to model advertisers, this exercise is interesting because the same 
assumption underlies auction theory (see section 2.1). This approach therefore provides a 
framework to seamlessly integrate auction theory and machine learning. 

As illustrated in figure 22, we treat the bid vector 6* = {b\ . . .oa) S [0, ftmax] 71 as the 
parameter of the conditional distribution P 6 *(6|a;, v) of the bids associated with the eligible 
ads. Each variable y a in the structural equation model represents the number of clicks 
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received by ads associated with bid b a . Each variable z a represents the amount charged for 
these clicks to the corresponding advertiser. 

Rational advertisers seek to maximize their surplus, that is, the difference between the 
value they see in the clicks and the price they pay to the publisher (figure 23). Therefore, 
each advertiser selects bids b a according to their anticipated impact on the number of 
resulting clicks y a and on their cost z a . 

Let V a denote the value of a click for the advertiser. Following the pattern of the perfect 
information assumption (see section 2.1), we assume that the advertisers eventually acquire 
full knowledge of the expectations 

Y a {0X) = I Va P 9A (u) and Z o (0A) = f z a P^(w) , 

and reach a Nash equilibrium 

Va b a G ArgMax U^(b u . . . ,b a - 1 ,b,b a+1 , . . . ,b A ) ■ (33) 

b 

with utility functions 

U e a {h) = V a Y a (9,K) - Z a {0,h) ■ 

The existence of such a Nash equilibrium is not obvious. 11 However, we do not strictly 
need to establish that this equilibrium exists for any combination of advertiser values V a . 
Since the true values are unknown, we shall use data collected when the system is stationary 
to estimate advertiser values V a that are consistent with a Nash equilibrium. We shall then 
estimate how a small change of the model parameters 9 displaces this posited equilibrium. 

The injection of smooth random noise into the auction mechanism changes the discrete 
problem into a continous problem amenable to well known differential methods. Therefore 
we assume that the densities P 6 *(6|x,f) and P 8 (q\x,a) are smooth enough to ensure that 
the expectations Y a and Z a are continuously differentiate functions of the parameters 6* 
and 9. The equilibrium (33) then satisfies the necessary Kuhn- Tucker conditions 

y y | <0 if&o = 0, 
Va V a —±-—^{>0 if& a = 6 max , (34) 



db a db a 



if < b a < h 



max- 



If the corresponding ad is displayed with sufficient frequency, we can estimate the value 
of the partial derivatives appearing in (34) by randomizing the bids and computing policy 
gradient as explained in section 5.4. 

However, the publisher is not allowed to directly randomize the bids because the ad- 
vertisers expect to pay prices computed using the bid they have specified and not the 
potentially higher bids resulting from the randomization. Fortunately, the publisher has 
full control on the estimated click probabilities qi iP (x). Since the rank-scores r{ jP (x) are 
the products of the bids and the estimated click probabilities (see section 2.1), a random 



11. Subject to certain assumptions on the utility functions, classic results ensure the existence of a Nash 
equilibrium for arbitrary values V a - For instance, if we assume that the marginal click prices increase with 
the click volume, the pricing curves are convex (figure 23), the utilities f/f (&) are diagonally quasiconcave, 
and Friedman's theorem (1977) establishes the existence of a Nash equilibrium. 
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Figure 23: Advertisers control the expected number of clicks Y a and expected prices Z a 
by adjusting their bids b a . Rational advertisers select bids that maximize the 
difference between the value they see in the clicks and the price they pay. 



multiplier applied to the bids can also be interpreted as a random multiplier applied to the 
estimated click probabilities. Under these two interpretations, the same ads are shown to 
the users, but different click prices are charged to the advertisers. Therefore, the publisher 
can simultaneously collect data as if the multiplier was applied to the bid, and charge prices 
computed as if the multiplier was applied to the estimated click probabilities. 

We can then estimate the advertiser values V a by solving the equilibrium equations. 
There are however a couple caveats: 

• The advertiser bid b a may be too small to cause ads to be displayed. In the absence 
of data, we have no means to estimate the value of a click for these advertisers. 

• Many ads are not displayed often enough to obtain accurate estimates of the partial 
derivatives and 0^ . This can be partially remediated by smartly aggregating the 
data of advertisers deemed similar. 

• Some advertisers attempt to capture all the available ad opportunities by placing 
extremely high bids and hoping to pay reasonable prices thanks to the generalized 
second price rule. Both partial derivatives and are equal to zero in such 
cases. Therefore we cannot recover V a by solving the equilibrium equation (34). It is 
however possible to collect useful data by selecting for these advertisers a maximum 
bid 

f*max that prevents them from monopolizing the eligible ad opportunities. Since 
the equilibrium condition is an inequality when b a = 6 maX) we can only determine a 
lower bound of the values V a for these advertisers. 

Let A be the set of the active advertisers, that is, the advertisers whose value can be 
estimated (or lower bounded) with sufficient accuracy. Assuming that the other advertisers 
leave their bids unchanged, we can estimate how the active advertisers adjust their bids in 
response to an infinitesimal change d6 of the scoring model parameters. This is achieved 
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by differentiating the equilibrium equations (34): 




dd + E ( v «' 



d 2 Y a , 

db a , db, 



a 




The partial second derivatives must be estimated as described in section 5.3. 
Solving the system (35) yields expressions of the form 



a 



Z a d6 . 



We can then estimate how any counterfactual expectation Y of interest changes when the 
publisher applies an infinitesimal change d9 to the scoring parameter 9 and the active 
advertisers A rationally adjust their bids b a in response: 



Although we only can estimate the reaction of the active advertisers A, expression (36) 
provides a useful and well characterized point of reference. We know that it does not include 
the potentially positive reaction of advertisers who did not bid but could have. We also 
know that advertisers placing unrealistically high bids are modeled pessimistically because 
we only can estimate a lower bound of their values. 

To alleviate these issues, we could alter the auction mechanism in ways that force these 
advertisers to reveal more information. We could also design experiments revealing the 
impact of the fixed costs incurred by advertisers participating into new auctions. Although 
additional work is needed to design such refinements, the quasi-static approach provides a 
generic framework to take such aspects into account. 

7.3 Dealing with multiple feedback loops 

Using the quasi-static methodology familiar to physicists, we have described how estimate 
the derivatives of counterfactual expectations that describe the system performance when 
it reaches an equilibrium induced by a causal feedback loop. A natural extension of this 
approach handles multiple simultaneous feedback loops: we simply write the derivatives of 
all the equilibrium equations and solve the resulting linear system. This flexibility provides 
countless refinement opportunities. 

8. Conclusion 

Using the ad placement example, this work demonstrates the central role of causal inference 
(Pearl, 2000; Spirtes et al., 1993) for the design of learning systems interacting with their 
environment. Thanks to importance sampling techniques, data collected during randomized 
experiments gives precious cues to assist the designer of such learning systems and useful 
signals to drive learning algorithms. 

Two recurrent themes structure this work. First, thanks to a sharp distinction between 
the learning algorithms and the extraction of the signals that drive them, these methods are 




(36) 



Such derivatives can of course drive learning algorithms (section 6). 
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applied to causal models with different structures, offering, for instance, a fresh viewpoint on 
known reinforcement learning algorithms. Second, maybe unsurprisingly, the mathematical 
and philosophical tools developed for the analysis of physical systems appear very effective 
for the analysis of causal information system and of their equilibria. With such themes, this 
work is also a vindication of cybernetics (Wiener, 1948). 
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Appendix 

A.l Greedy Ad Placement Algorithms 

Section 2.1 describes how to select and place ads on a web page by maximizing the total 
rank-score (1). Following (Varian, 2007; Edelman et al., 2007), we assume that the click 
probability estimates are expressed as the product of a positive position term 7 P and a 
positive ad term pi(x). The rank-scores can therefore be written as r^ p (x) = j p bi[3i(x). We 
also assume that the policy constraints simply state that a web page should not display 
more than one ad belonging to any given advertiser. The discrete maximization problem is 
then amenable to computationally efficient greedy algorithms. 

Let us fix a layout L and focus on the inner maximization problem. Without loss of 
generality, we can renumber the positions such that 

L = {1,2,...N} and 71 > 72 > • • • > . 

and write the inner maximization problem as 

max TZ L (h,. . . ,i N ) = p (x) 

H AT ~r 

pel/ 

subject to the policy constraints and reserve constraints r^ p {x) > R p {x). 

Let Si denote the advertiser owning ad i. The set of ads is then partitioned into subsets 
X s = {i : Si = s} gathering the ads belonging to the same advertiser s. The ads that 
maximize the product bif3i{x) within set X s are called the best ads for advertiser s. If the 
solution of the discrete maximization problem contains one ad belonging to advertiser s, 
then it is easy to see that this ad must be one of the best ads for advertiser s: were it not 
the case, replacing the offending ad by one of the best ads would yield a higher IZl without 
violating any of the constraints. It is also easy to see that one could select any of the best 
ads for advertiser s without changing IZl- 

Let the set I* contain exactly one ad per advertiser, arbitrarily chosen among the best 
ads for this advertiser. The inner maximization problem can then be simplified as: 

max 1Z L (h, ...,«» = V 7 P \ Pi p (x) 

.l,....t JV 6X pgL 

where all the indices i\, . , . , ijv are distinct, and subject to the reserve constraints. 

Assume that this maximization problem has a solution ii, . . . , ijy, meaning that there is 
a feasible ad placement solution for the layout L. For k = 1 . . . JV, let us define It C X* as 

I* k = ArgMax hPi(v) ■ 
ieX*\{ii,...,ife_i} 

It is easy to see that 1| intersects . . . , because, were it not the case, replacing if- by 
any element of i| would increase IZl without violating any of the constraints. Furthermore 
it is easy to see that it £ It because, were it not the case, there would be h > k such that 
ih £ I/j, and swapping if, and would increase TZl without violating any of the constraints. 

Therefore, if the inner maximization problem admits a solution, we can compute a 
solution by recursively picking i\, . . . , from /*, J|, . . . , 1^. This can be done efficiently 
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by first sorting the bif3i{x) in decreasing order, and then greedily assigning ads to the best 
positions subject to the reserve constraints. This operation has to be repeated for all possible 
layouts, including of course the empty layout. 

The same analysis can be carried out for click prediction estimates expressed as arbitrary 
monotone combination of a position term 7 P (x) and an ad term /3i(x), as shown, for instance, 
by Graepel et al. (2010). 

A. 2 Confidence Intervals 

Section 4.4 explains how to obtain improved confidence intervals by replacing the unbiased 
importance sampling estimator (9) by the capped importance sampling estimator (11). This 
appendix provides details that could have obscured the main message. 

A. 2.1 Outer confidence interval 

We first address the computation of the outer confidence interval (12) which describes how 
the estimator Y* approaches the capped expectation Y*. 

r 1 n 

Y* = £(u) w(u) P(w) « Y* = «KwO • 

Juj n . =1 

Since the samples l(u>i) w(u)i) are independent and identically distributed, the central limit 
theorem (e.g., Cramer, 1946, section 17.4) states that the empirical average Y* converges in 
law to a normal distribution of mean Y* = E[£(lj) w(lj)] and variance V = var[£(u>) w(oj)]. 
Since this convergence usually occurs quickly, it is widely accepted to write 

P{ Y* - e R < Y* < Y* + e R } > 1 - 5 , 

with 

e R = erf \l - 5) . (37) 

and to estimate the variance V using the sample variance V 

1 n 2 

V « V = few<) w(coi) - Y*) . 

n — 1 r~r v ' 

This approach works well when the ratio ceiling R is relatively small. However the presence 
of a few very large ratios makes the variance estimation noisy and might slow down the 
central limit convergence. 

The first remedy is to bound the variance more rigorously. For instance, the following 
bound results from (Maurer and Pontil, 2009, theorem 10). 

Combining this bound with (37) gives a confidence interval valid with probability greater 
than 1 — 25. Although this approach eliminates the potential problems related to the 
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variance estimation, it does not address the potentially slow convergence of the central 
limit theorem. 

The next remedy is to rely on empirical Bernstein bounds to derive rigorous confidence 
intervals that leverage both the sample mean and the sample variance (Audibert et al., 
2007; Maurer and Pontil, 2009). 

Theorem 1 (Empirical Bernstein bound) (Maurer and Pontil, 2009, thm 4) 

Let X, Xi, X2, ■ ■ ■ , X n be i.i.d. random variable with values in [a,b] and let 5 > 0. Then, 

with probability at least 1 — 8, 

wm-M. < J^m +( ,_ o) ™>, 

V n 3{n — 1) 

where M n and V n respectively are the sample mean and variance 

■\ n 1 n 
M n = -Y J X i , V n = -YsiXi-Mn) 2 . 

n * — ' in — * — ' 



Applying this theorem to both t{uJi) w{oJi) and —(.{oji) w{uji) provides confidence intervals 
that hold for for the worst possible distribution of the variables £(co) and w(oj). 

P{ Y* - e R < Y* < Y* + e R } > 1 - 25 

where 

V n 3(n — 1) 

Because they hold for the worst possible distribution, confidence intervals obtained in 
this way are less tight than confidence intervals based on the central limit theorem. On 
the other hand, thanks the the Bernstein bound, they remains reasonably competitive, and 
they provide much stronger guarantee. 

A. 2. 2 Inner confidence interval 

Inner confidence intervals are derived from inequality (14) which bounds the difference 
between the counterfactual expectation Y* and the capped expectation Y* : 

< Y* - Y* < M ( 1 - W* 



The constant M is defined by assumption (10). The first step of the derivation consists in 
obtaining a lower bound of W* — W* using either the central limit theorem or an empirical 
Bernstein bound. 

For instance, applying theorem 1 to —w{uoi) yields 



P< W* > W* 



l 2V w log(2/5) ^ 71og(2/£) | > l 
n 3(n — 1) [ 
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where V w is the sample variance of the capped weights 

, 2 



1 

— {^i) 



V w = > (w(oJi)-W* 



Replacing in inequality (14) gives the outer confidence interval 

p{ Y* < Y* < Y* + M(l - W* + Cr) } > 1 - S . 

with 

V n 3(n — 1) 

Note that 1 — W + £r can occasionally be negative. This occurs in the unlucky cases where 
confidence interval is violated, with probability smaller than 8. 

Putting together the inner and outer confidence intervals, 

P{ Y* - e R < Y* < Y* + M(l - W* + + e R } > 1 - 35 , (40) 
with cr and ^ computed as described in expressions (38) and (39). 

A. 3 Using Invariant Variables and Predictors 

This appendix describes useful techniques leveraging invariant variables and invariant pre- 
dictors to construct better confidence intervals. 

We use the notation v to represent the variables of the structural equation model that 
are left unchanged by the intervention under considerations. Such variables satisfy the 
relations P*(v) = P(f) and P*(w) = P*(uj\v \ v) P(f), where we use notation oj\v to denote 
all remaining variables in the structural equation model. An invariant predictor is then a 
function ((v) that is believed to be a good predictor of £(uj). In particular, it is expected 
that var[£(cj) — Ci v )} is smaller than var[£(u;)]. 

A. 3.1 Inner confidence interval with dependent bounds 

We first describe how to construct finer inner confidence intervals by using more refined 
bounds on t(oj). In particular, instead of the simple bound (10), we can use bounds that 
depend on invariant variables: 

m < m(v) < £(gj) < M(v) < M . 
The key observation is the equality 

E[w*(u)\v] = [ w*(oj)P(u\v\v) = [ C^i^f/ 1 ? P(«>\v\v) = 1. 
We can then write 

Y* -Y* = f [w*(u) -w*(u)] e(ui) P(w) < Je[w*(u)-w*(u)\v] M{v) P(v) 

1-K[w*(w)\v\) M{v) P{v) = f (l-w*(u)) M(v) P(w) = B hl . 
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Using a similar derivation for the lower bound B\ Q , we obtain the inequality 

Bio < Y*-Y* < B hi 

With the notations 

1 n 1 n 

1 " 2 1 ™ 

Mo = -I] [(l-^(wi))m(w i )-B lo l , 7 hi = -Y\(l-w*(to t ))M(v,)-B hi 

n—J ^— ' L J n— 1 ' L 

1=1 »=1 



/ 2M^og(2/j) , |p 71og(2/J) i / 2y M log(2/Q , , p 71og(2/J) 

&o = V h mi?— -— , Chi = V \-\M\R— — , 

V n 3{n — 1) y n 3{n — 1) 

two applications of theorem 1 give the inner confidence interval: 

P{ Y* + B lo - 6o < Y* < Y* + B hi + e hi } > 1 - 25 . 

A. 3. 2 Confidence Intervals for Counterfactual Differences 

We now describe how to leverage invariant predictors in order to construct tighter confidence 
intervals for the difference of two counterfactual expectations (section 5.2): 

Y+-Y* « -J2 " C(vi) } Aw^i) with Aw(u>) = P+ n - p *H . 

Let us define the reweigthing ratios w + (lu) = P + (w)/P(u;) and w*(u>) = P*(w)/P(u;), their 
capped variants w + (oj) and w*(u>), and the capped centered expectations 



Y+ = / [l{u) - C(v)} w+(u) P(w) and Y c * = / [%) - C(«)] u>» P(u,) . 

Jul J Ul 

The outer confidence interval is obtained by applying the techniques of section A. 2.1 to 
Y+-Y c * = [ [£(u)-{(v)}[w+(u;)-w*(cj)}P(u;). 



Since the weights w + — w* can be positive or negative, adding or removing a constant 
to £(cu) can considerably change the variance of the outer confidence interval. This means 
that one should always use a predictor. Even a constant predictor can vastly improve the 
outer confidence interval difference. 

The inner confidence interval is then obtained by writing the difference 

(y+-y*)-(y+-y*) = I [f( w )-c(,)]KM-» + H]p( w ) 



[l(u)-«v)] [w*(u)-w*(u)] PH 

and bounding both terms by leveraging ^-dependent bounds on the integrand: 

— M < -C(v) < £(u) - C{v) < M — C(w) < M . 
This can be achieved as shown in section A. 3.1. 
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A. 4 Uniform empirical Bernstein bounds 

This appendix reviews the uniform empirical Bernstein bound given by Maurer and Pontil 
(2009) and describes how it can be used to construct the uniform confidence interval (28). 

The first step is to characterize the size of a family T of functions mapping a space X 
into the interval [a, b] C M.. Given n points x = {x\. . .x n ) G X n , the trace J-{x) G W 1 is the 
set of vectors (f(xi), . . . , f(x n )) for all functions / G T . 

Definition 2 (Covering numbers, etc.) Given e > 0, the covering number M(x,e, J- ') 
is the smallest possible cardinality of a subset C C .F(x) satisfying the condition 

G .F(x) 3c G C max \vi — c$| < e , 
i=l. ..n 

and i/ie growth function J\f(n, e, J 7 ) is 

M{n,e,F) = sup jV(x, e, 7") . 

Thanks to a famous combinatorial lemma (Vapnik and Chervonenkis, 1968, 1971; Sauer, 
1972), for many usual parametric families J 7 , the growth function M(n, e,J-) increases at 
most polynomially 12 with both n and 1 je. 

Theorem 3 (Uniform empirical Bernstein bound) (Maurer and Pontil, 2009, thm 6) 
Let 5 G (0,1), n >= 16. Let X, X\, . . . ,X n be i.i.d. random variables with values in X. 
Let T be a set of functions mapping X into [a,b] C M. and let M.{n) = 10AA(2n, J 7 , 1/n). 
Then we probability at least 1 — 5, 

V/G* E[/(X)]-M„ < >WXHM) IS logW,,)/*) 

V n n — 1 

where M n and V n respectively are the sample mean and variance 

, n I n 

n n — 1 r— • 

i=i i=i 

The statement of this theorem emphasizes its similarity with the non-uniform empirical 
Bernstein bound (theorem 1). Although the constants are less attractive, the uniform bound 
still converges to zero when n increases, provided of course that Af(n) = 10AA(2n, J 7 , 1/ra) 
grows polynomially with n. 

Let us then define the family of functions 

T = { f e : u l(u)v%(u) , g e : w ^ ^(w) , WeJ), 

and use the uniform empirical Bernstein bound to derive an outer inequality similar to (38) 
and an inner inequality similar to (39). The theorem implies that, with probability 1 — 5, 
both inequalities are simultaneously true for all values of the parameter 8. The uniform 
confidence interval (28) then follows directly. 

12. For a simple prool ol this fact, slice [a, b] into intervals Sk of maximal width e and apply the lemma to 
the family of indicator functions (xi, Sk) >->• l{f(%i) G Sk}- 
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A. 5 Bibliographical Notes 

This appendix expands the references discussed in the main text. 
A. 5.1 Notes for section 2 

The presentation of the ad placement problem summarizes experience acquired at Microsoft 
adCenter in 2010 and 2011. Generalized second order auctions for position auctions were 
discussed by Varian (2007) and Edelman et al. (2007). Simpson's paradox (Simpson, 1951) 
has been abundantly discussed, for instance by Pearl (2000). The example of Simpson's 
paradox in ad placement has not been previously published. 

A. 5. 2 Notes for section 3 

Despite burning philosophical debates, the manipulability theory of causation (e.g., von 
Wright, 1971; Woodward, 2005) has gained acceptance in the statistical community. Rubin 
(1986) argues that there is "no causation without manipulation". Viewing causation as a 
reasoning model (Bottou, 2011) gives statistical meaning to causal statements that do not 
lend themselves to manipulation. 

Structural equation models (Wright, 1921) replicate the methods of classical physics 
developed and rationalized during the age of enlightenment. Reichenbach (1956) gives a 
thorough discussion of causation in physics. Wiener (1948) demonstrates how the methods 
of physics can be applied to the treatment of information in general, even when this infor- 
mation does not describe physical quantities. The discussion of the isolation assumption in 
section 3.2 is clearly inspired by this connection. 

Pearl (2009) gives a clear and concise review of the connection between structural equa- 
tion models and probabilistic models and of its implications for causal inference. Pearl 
(2000) contains considerably more details and references. Spirtes et al. (1993) focus on the 
discovery of causal relations. 

A. 5. 3 Notes for section 4 

Many of the estimation techniques described in sections 4 and 5 have known counterparts 
in the special cases of reinforcement learning, contextual bandits, and multi-armed bandits. 

Monte-Carlo methods for reinforcement learning (Sutton and Barto, 1998, chapter 5) 
are essentially reweighted counterfactual estimates. Reinforcement learning research tradi- 
tionally focuses on control problems with relatively small state spaces and long sequences 
of observations. This reduces the need for characterizing exploration with tight confi- 
dence intervals. For instance, Sutton and Barto suggest to normalize the estimator (7) 
by 1/Yli w^i) instead of 1/n. This works poorly when parts of the state space are left 
unexplored. 

A series of papers on the evaluation of contextual bandits algorithms appeared while 
our ad placement work was under way. Strehl et al. (2010) describe a capped estimator 
for contextual bandits, unfortunately in a setup that subtly assumes the absence of un- 
known confounding variables. Li et al. (2011) evaluate contextual bandit policies using 
randomization and reweighted estimates. 



51 



Leon Bottou et al. 



A. 5. 4 Notes for section 5 

Hesterberg (1988) describes variance reduction methods for importance sampling. These 
methods are designed to improve the confidence interval of the normal average by con- 
structing a sampling distribution that reduces the variance. Our situation is less favorable 
because we want to use a same sampling distribution to obtain estimates for many coun- 
terfactual distributions. Dudik et al. (2011) propose a variance reduction method that 
extends the invariant predictor method to any predictor whose conditional expectation 
(*(v) = f u \ v ((to) P*(oj\v) can be computed efficiently 

The policy gradient estimation method has been rediscovered multiple times (Aleksan- 
drov et al., 1968; Glynn, 1987; Williams, 1992) in slightly different contexts. Variance 
reduction with a baseline has been analyzed by Greensmith et al. (2002). 

A. 5. 5 Notes for section 6 

The study of learning principles using uniform confidence intervals has been pioneered by 
Vapnik and Chervonenkis (1968) and has been the object of numerous developments in 
statistics and in machine learning. 

Squashing the click probability estimates in ad placement has been described by Lahaie 
and McAfee (2011). The general auction tuning scheme is due to Charles and Chickering 
(2012). 

Both Wald (1945) and Robbins (1952) give an overview of early works on sequential 
design. Robbins formulates the two-armed bandit problem as one of the simplest instance 
of a problem involving the explore/exploit trade-off. The fc-armed bandit problem has 
received an elegant Bayesian solution (Gittins, 1989). Computationally efficient algorithms 
verify regret bounds that grow at the optimal rate (Auer et al., 2002; Audibert et al., 2007; 
Seldin et al., 2012) but are often outperformed by algorithms relying on Gittins indices or 
on the Thompson sampling heuristic (Chapelle and Li, 2011). Simple heuristics perform 
suprisingly well (Vermorel and Mohri, 2005; Kuleshov and Precup, 2010). The design of 
computationally efficient algorithms to optimally balance exploration and exploitation in 
arbitrarily complex learning systems is still considered an open problem. 

A. 5. 6 Notes for section 7 

Although the differential analysis of equilibrium has been used for centuries in classical 
mechanics, we are not aware of previous uses of this technique to deal with long-term 
feedback in learning systems. 
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